## Speech Emotion Recognition: Audio Classification

Dataset Source: https://www.kaggle.com/datasets/dmitrybabko/speech-emotion-recognition-en

### Prepare

**bold text**#### Install Missing Libraries

#### Import Necessary Libraries

In [1]:
import os, sys, random, glob
os.environ['TOKENIZERS_PARALLELISM']='false'

import numpy as np
import pandas as pd

import datasets
from datasets import load_dataset, Audio, DatasetDict
from datasets import Audio, Features, ClassLabel

import torch

import transformers
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
from transformers import TrainingArguments, Trainer

import evaluate

from IPython.display import display

# !git lfs install

In [2]:
# Check that MPS is available
if not torch.backends.mps.is_available():
    if not torch.backends.mps.is_built():
        print("MPS not available because the current PyTorch install was not "
              "built with MPS enabled.")
    else:
        print("MPS not available because the current MacOS version is not 12.3+ "
              "and/or you do not have an MPS-enabled device on this machine.")

else:
    mps_device = torch.device("mps")

#### Access to HuggingFace Hub

In [25]:
# !huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
[1m[31mCannot authenticate through 

#### Mount Google Drive

#### Display Library Versions

In [3]:
print("Python:".rjust(15), sys.version[0:6])
print("NumPy:".rjust(15), np.__version__)
print("Pandas:".rjust(15), pd.__version__)
print("Datasets:".rjust(15), datasets.__version__)
print("Torch:".rjust(15), torch.__version__)
print("Transformers:".rjust(15), transformers.__version__)
print("Evaluate:".rjust(15), evaluate.__version__)

        Python: 3.11.8
         NumPy: 1.26.4
        Pandas: 2.2.1
      Datasets: 2.18.0
         Torch: 2.2.1
  Transformers: 4.39.0
      Evaluate: 0.4.1


### Load Data

#### Create Dictionaries to Convert Labels Between Strings & Integers

In [4]:
labels = ["SAD",
          "ANGRY",
          "DISGUST",
          "FEAR",
          "HAPPY",
          "NEUTRAL"]


NUM_OF_LABELS = len(labels)

label2id, id2label = dict(), dict()

for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

initial_label_update = {"SAD": "SAD",
                        "ANG": "ANGRY",
                        "DIS": "DISGUST",
                        "FEA": "FEAR",
                        "HAP": "HAPPY",
                        "NEU": "NEUTRAL"}


print(labels)
print(NUM_OF_LABELS)
print(label2id)
print(id2label)

['SAD', 'ANGRY', 'DISGUST', 'FEAR', 'HAPPY', 'NEUTRAL']
6
{'SAD': '0', 'ANGRY': '1', 'DISGUST': '2', 'FEAR': '3', 'HAPPY': '4', 'NEUTRAL': '5'}
{'0': 'SAD', '1': 'ANGRY', '2': 'DISGUST', '3': 'FEAR', '4': 'HAPPY', '5': 'NEUTRAL'}


#### Prepare Metadata File

In [5]:
parent_dir = "./dataset/CREMA-D"

dir_path = os.path.join(parent_dir, "*.wav")

files_and_name = glob.glob(dir_path)

metadata = pd.DataFrame(files_and_name, columns=["file_path"])

metadata['file_name'] = metadata['file_path'].apply(lambda x: x.split("/")[-1])

metadata['label'] = metadata['file_path'].apply(lambda x: x.split("/")[-1].split("_")[-2])
metadata['label'] = metadata['label'].replace(initial_label_update)
metadata['label'] = metadata['label'].replace(label2id)

metadata = metadata.drop(columns=["file_path"])

metadata_file_location = os.path.join(parent_dir, "metadata.csv")
metadata.to_csv(metadata_file_location, index=False)

metadata.head()

Unnamed: 0,file_name,label
0,1022_ITS_ANG_XX.wav,1
1,1037_ITS_ANG_XX.wav,1
2,1060_ITS_NEU_XX.wav,5
3,1075_ITS_NEU_XX.wav,5
4,1073_IOM_DIS_XX.wav,2


#### Ingest & Preprocess Dataset

In [6]:
audio_data = load_dataset(parent_dir)
audio_data['train'][10]

Resolving data files:   0%|          | 0/7443 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

{'audio': {'path': '/Users/bianca/Library/CloudStorage/OneDrive-SharedLibraries-NationalUniversityofSingapore/Capstone Scoping - 2. data/dataset/CREMA-D/1001_IEO_DIS_LO.wav',
  'array': array([-0.00228882, -0.00204468, -0.00180054, ...,  0.        ,
          0.        ,  0.        ]),
  'sampling_rate': 16000},
 'label': 2}

#### Cast Audio Feature to Data Type of Audio

In [7]:
audio_data = audio_data.cast_column("audio", Audio(sampling_rate=16000))

#### Split Dataset into Training & Testing Datasets

In [8]:
audio_data = audio_data.shuffle(seed=42)

audio_data_split = audio_data['train'].train_test_split(test_size=0.25)

ds = DatasetDict({
    'train' : audio_data_split['train'],
    'eval' : audio_data_split['test']
})

#### Some Information About Training & Validation Datasets

In [9]:
print("Training Dataset")
print("Training Dataset Info: ", ds['train'])
print("First Sample in Training Dataset", ds['train'][0])
print("Last Sample in Training Dataset", ds['train'][-1])
print("Unique Values in Label/Class: ", sorted(ds['train'].unique("label")))

print("\n\nEvaluation Dataset")
print("Evaluation Dataset Info: ", ds['eval'])
print("First Sample in Evaluation Dataset", ds['eval'][0])
print("Last Sample in Evaluation Dataset", ds['eval'][-1])
print("Unique Values in Label/Class: ", sorted(ds['eval'].unique("label")))

Training Dataset
Training Dataset Info:  Dataset({
    features: ['audio', 'label'],
    num_rows: 5581
})
First Sample in Training Dataset {'audio': {'path': '/Users/bianca/Library/CloudStorage/OneDrive-SharedLibraries-NationalUniversityofSingapore/Capstone Scoping - 2. data/dataset/CREMA-D/1028_ITS_HAP_XX.wav', 'array': array([1.11694336e-02, 1.01928711e-02, 1.03149414e-02, ...,
       6.10351562e-05, 6.10351562e-05, 6.10351562e-05]), 'sampling_rate': 16000}, 'label': 4}
Last Sample in Training Dataset {'audio': {'path': '/Users/bianca/Library/CloudStorage/OneDrive-SharedLibraries-NationalUniversityofSingapore/Capstone Scoping - 2. data/dataset/CREMA-D/1046_ITS_SAD_XX.wav', 'array': array([-0.00231934, -0.00296021, -0.00350952, ...,  0.        ,
        0.        ,  0.        ]), 'sampling_rate': 16000}, 'label': 0}


Flattening the indices:   0%|          | 0/5581 [00:00<?, ? examples/s]

Unique Values in Label/Class:  [0, 1, 2, 3, 4, 5]


Evaluation Dataset
Evaluation Dataset Info:  Dataset({
    features: ['audio', 'label'],
    num_rows: 1861
})
First Sample in Evaluation Dataset {'audio': {'path': '/Users/bianca/Library/CloudStorage/OneDrive-SharedLibraries-NationalUniversityofSingapore/Capstone Scoping - 2. data/dataset/CREMA-D/1010_TSI_ANG_XX.wav', 'array': array([ 0.0005188 , -0.00033569, -0.00048828, ...,  0.        ,
        0.        ,  0.        ]), 'sampling_rate': 16000}, 'label': 1}
Last Sample in Evaluation Dataset {'audio': {'path': '/Users/bianca/Library/CloudStorage/OneDrive-SharedLibraries-NationalUniversityofSingapore/Capstone Scoping - 2. data/dataset/CREMA-D/1057_IWW_ANG_XX.wav', 'array': array([5.52368164e-03, 5.06591797e-03, 4.51660156e-03, ...,
       0.00000000e+00, 0.00000000e+00, 3.05175781e-05]), 'sampling_rate': 16000}, 'label': 1}


Flattening the indices:   0%|          | 0/1861 [00:00<?, ? examples/s]

Unique Values in Label/Class:  [0, 1, 2, 3, 4, 5]


#### Display Some Examples with Ability to Listen to Them

In [10]:
for _ in range(5):
    from IPython.display import Audio, display
    rand_idx = random.randint(0, len(ds["train"])-1)
    example = ds["train"][rand_idx]
    audio = example["audio"]

    print(f'Label: {id2label[str(example["label"])]}')
    print(f'Shape: {audio["array"].shape}, sampling rate: {audio["sampling_rate"]}')
    display(Audio(audio["array"], rate=audio["sampling_rate"]))
    print()

Label: SAD
Shape: (47514,), sampling rate: 16000



Label: DISGUST
Shape: (50183,), sampling rate: 16000



Label: ANGRY
Shape: (43777,), sampling rate: 16000



Label: FEAR
Shape: (38972,), sampling rate: 16000



Label: ANGRY
Shape: (46446,), sampling rate: 16000





### Model


#### Basic Values/Constants

In [11]:
MODEL_CKPT = "facebook/wav2vec2-base"
MODEL_NAME = MODEL_CKPT.split("/")[-1] + "-Speech_Emotion_Recognition"

NUM_OF_EPOCHS = 10
LEARNING_RATE = 3e-5

BATCH_SIZE = 64 #32 GPU usage 7.2GB
STRATEGY = "epoch"

#### Set Sample Rate

In [12]:
sampling_rate = ds["train"].features["audio"].sampling_rate
sampling_rate

16000

#### Instantiate Instance of Feature Extractor

In [13]:
feature_extractor = AutoFeatureExtractor.from_pretrained(MODEL_CKPT)



#### Define function to Preprocess Data

In [14]:
def preprocess_function(examples):
    '''
    This function prepares the dataset for the transformer
    by applying the feature extractor to it (among other
    processes).
    '''
    max_duration = 3.0 # seconds
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(audio_arrays,
                               sampling_rate=feature_extractor.sampling_rate,
                               max_length=int(feature_extractor.sampling_rate * max_duration),
                               truncation=True)
    return inputs

encoded_audio = ds.map(preprocess_function, remove_columns="audio",
                       batch_size=100,batched=True,num_proc=4)

Map (num_proc=4):   0%|          | 0/5581 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/1861 [00:00<?, ? examples/s]

#### Define Metrics Evaluation Function

In [15]:
def compute_metrics(p):
    '''
    This function calculates & returns the following metrics:
    - accuracy
    - f1 score
    - recall
    - precision
    '''
    import evaluate

    accuracy_metric = evaluate.load("accuracy")

    accuracy = accuracy_metric.compute(predictions=np.argmax(p.predictions,
                                                             axis=1),
                                       references=p.label_ids)['accuracy']

    ### ------------------- F1 scores -------------------

    f1_score_metric = evaluate.load("f1")

    weighted_f1_score = f1_score_metric.compute(predictions=np.argmax(p.predictions,
                                                                      axis=1),
                                                references=p.label_ids,
                                                average='weighted')["f1"]

    micro_f1_score = f1_score_metric.compute(predictions=np.argmax(p.predictions,
                                                                   axis=1),
                                             references=p.label_ids,
                                             average='micro')['f1']

    macro_f1_score = f1_score_metric.compute(predictions=np.argmax(p.predictions,
                                                                   axis=1),
                                             references=p.label_ids,
                                             average='macro')["f1"]

    ### ------------------- recall -------------------

    recall_metric = evaluate.load("recall")

    weighted_recall = recall_metric.compute(predictions=np.argmax(p.predictions,
                                                                  axis=1),
                                            references=p.label_ids,
                                            average='weighted')["recall"]

    micro_recall = recall_metric.compute(predictions=np.argmax(p.predictions,
                                                               axis=1),
                                         references=p.label_ids,
                                         average='micro')["recall"]

    macro_recall = recall_metric.compute(predictions=np.argmax(p.predictions,
                                                               axis=1),
                                         references=p.label_ids,
                                         average='macro')["recall"]

    ### ------------------- precision -------------------

    precision_metric = evaluate.load("precision")

    weighted_precision = precision_metric.compute(predictions=np.argmax(p.predictions,
                                                                        axis=1),
                                                  references=p.label_ids,
                                                  average='weighted')["precision"]

    micro_precision = precision_metric.compute(predictions=np.argmax(p.predictions,
                                                                     axis=1),
                                               references=p.label_ids,
                                               average='micro')["precision"]

    macro_precision = precision_metric.compute(predictions=np.argmax(p.predictions,
                                                                     axis=1),
                                               references=p.label_ids,
                                               average='macro')["precision"]

    return {"accuracy" : accuracy,
            "Weighted F1" : weighted_f1_score,
            "Micro F1" : micro_f1_score,
            "Macro F1" : macro_f1_score,
            "Weighted Recall" : weighted_recall,
            "Micro Recall" : micro_recall,
            "Macro Recall" : macro_recall,
            "Weighted Precision" : weighted_precision,
            "Micro Precision" : micro_precision,
            "Macro Precision" : macro_precision
            }

#### Instantiate Model

In [16]:
model = AutoModelForAudioClassification.from_pretrained(MODEL_CKPT,
                                                        num_labels=NUM_OF_LABELS,
                                                        label2id=label2id,
                                                        id2label= id2label)

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Define Training Arguments

In [17]:
args = TrainingArguments(
    output_dir=MODEL_NAME,
    evaluation_strategy=STRATEGY,
    num_train_epochs=NUM_OF_EPOCHS,
    save_strategy=STRATEGY,
    logging_strategy="steps",
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    warmup_ratio=0.10,
    gradient_accumulation_steps=4,
    logging_first_step=True,
    report_to="tensorboard",
    hub_private_repo=True,
    # use_mps_device=True,
    gradient_checkpointing=True
)



#### Define Trainer

In [18]:
MODEL_NAME

'wav2vec2-base-Speech_Emotion_Recognition'

In [19]:
trainer = Trainer(
    model = model,
    args = args,
    train_dataset = encoded_audio["train"],
    eval_dataset = encoded_audio["eval"],
    tokenizer = feature_extractor,
    compute_metrics = compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


### Traning

#### Train Model

In [None]:
trainer.train()

#### Evaluate Model

In [26]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 1861
  Batch size = 32


{'eval_loss': 0.8106924891471863,
 'eval_accuracy': 0.7281031703385277,
 'eval_Weighted F1': 0.7247543780750472,
 'eval_Micro F1': 0.7281031703385277,
 'eval_Macro F1': 0.7268519957485492,
 'eval_Weighted Recall': 0.7281031703385277,
 'eval_Micro Recall': 0.7281031703385277,
 'eval_Macro Recall': 0.7310833557439055,
 'eval_Weighted Precision': 0.7319188411210771,
 'eval_Micro Precision': 0.7281031703385277,
 'eval_Macro Precision': 0.732869407033253,
 'eval_runtime': 83.3066,
 'eval_samples_per_second': 22.339,
 'eval_steps_per_second': 0.708,
 'epoch': 9.98}

#### Push Model to Hub (My Profile!!!)

In [34]:
trainer.model.push_to_hub('BiancaZYCao/wav2vec2-base-Speech_Emotion_Recognition')

Configuration saved in wav2vec2-base-Speech_Emotion_Recognition/config.json
Model weights saved in wav2vec2-base-Speech_Emotion_Recognition/pytorch_model.bin
Uploading the following files to BiancaZYCao/wav2vec2-base-Speech_Emotion_Recognition: pytorch_model.bin,config.json


pytorch_model.bin:   0%|          | 0.00/378M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/BiancaZYCao/wav2vec2-base-Speech_Emotion_Recognition/commit/1dda97df49119df844a25b16b2f23d309691be08', commit_message='Upload Wav2Vec2ForSequenceClassification', commit_description='', oid='1dda97df49119df844a25b16b2f23d309691be08', pr_url=None, pr_revision=None, pr_num=None)

In [31]:
trainer.save_model('/content/drive/MyDrive/models/')

Saving model checkpoint to /content/drive/MyDrive/models/
Configuration saved in /content/drive/MyDrive/models/config.json
Model weights saved in /content/drive/MyDrive/models/pytorch_model.bin
Feature extractor saved in /content/drive/MyDrive/models/preprocessor_config.json
Saving model checkpoint to wav2vec2-base-Speech_Emotion_Recognition
Configuration saved in wav2vec2-base-Speech_Emotion_Recognition/config.json
Model weights saved in wav2vec2-base-Speech_Emotion_Recognition/pytorch_model.bin
Feature extractor saved in wav2vec2-base-Speech_Emotion_Recognition/preprocessor_config.json
Dropping the following result as it does not have all the necessary fields:
{'dataset': {'name': 'audiofolder', 'type': 'audiofolder', 'config': 'default', 'split': 'None', 'args': 'default'}, 'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.7281031703385277}]}


### Notes & Other Takeaways From This Project
****
-
- Results:
    - Accuracy: 0.753895754970446
    - Weighted F1: 0.7514267219693074
    - Weighted Recall: 0.753895754970446
    - Weighted Precision: 0.7564957587826933
****