# Emotion Recognition in Greek Speech Using Wav2Vec 2.0

**Wav2Vec 2.0** is a pretrained model for Automatic Speech Recognition (ASR) and was released in [September 2020](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) by Alexei Baevski, Michael Auli, and Alex Conneau.  Soon after the superior performance of Wav2Vec2 was demonstrated on the English ASR dataset LibriSpeech, *Facebook AI* presented XLSR-Wav2Vec2 (click [here](https://arxiv.org/abs/2006.13979)). XLSR stands for *cross-lingual  speech representations* and refers to XLSR-Wav2Vec2`s ability to learn speech representations that are useful across multiple languages.

Similar to Wav2Vec2, XLSR-Wav2Vec2 learns powerful speech representations from hundreds of thousands of hours of speech in more than 50 languages of unlabeled speech. Similar, to [BERT's masked language modeling](http://jalammar.github.io/illustrated-bert/), the model learns contextualized speech representations by randomly masking feature vectors before passing them to a transformer network.

![wav2vec2_structure](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/xlsr_wav2vec2.png)

The authors show for the first time that massively pretraining an ASR model on cross-lingual unlabeled speech data, followed by language-specific fine-tuning on very little labeled data achieves state-of-the-art results. See Table 1-5 of the official [paper](https://arxiv.org/pdf/2006.13979.pdf).

During fine-tuning week hosted by HuggingFace, more than 300 people participated in tuning XLSR-Wav2Vec2's pretrained on low-resources ASR dataset for more than 50 languages. This model is fine-tuned using [Connectionist Temporal Classification](https://distill.pub/2017/ctc/) (CTC), an algorithm used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition. Follow this [notebook](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_Tune_XLSR_Wav2Vec2_on_Turkish_ASR_with_%F0%9F%A4%97_Transformers.ipynb#scrollTo=Gx9OdDYrCtQ1) for more information about XLSR-Wav2Vec2 fine-tuning.

This model was shown significant results in many low-resources languages. You can see the [competition board](https://paperswithcode.com/dataset/common-voice) or even testing the models from the [HuggingFace hub](https://huggingface.co/models?filter=xlsr-fine-tuning-week). 


In this notebook, we will go through how to use this model to recognize the emotional aspects of speech in a language (or even as a general view using for every classification problem). Before going any further, we need to install some handy packages and define some enviroment values.

In [1]:
%%capture
%env LC_ALL=C.UTF-8
%env LANG=C.UTF-8
%env TRANSFORMERS_CACHE=content/cache
%env HF_DATASETS_CACHE=content/cache
%env CUDA_LAUNCH_BLOCKING=1

## Prepare Data

For this particular example, we use [Acted Emotional Speech Dynamic Database – AESDD](http://m3c.web.auth.gr/research/aesdd-speech-emotion-recognition/) provided by Multidisciplinary Media & Mediated Communication Research Group ([M3C](http://m3c.web.auth.gr/)). 

The Acted Emotional Speech Dynamic Database (AESDD) is a publically available speech emotion recognition dataset that contains utterances of acted emotional speech in the Greek language for five different emotions `sadness`, `disgust`, `happiness`, `anger`, and `fear`.

The dataset consists of directories of emotions; each folder includes specific emotions. We need to loop over directories and save the paths related to each class based on the directory name.

```bash
.
├── Tools\ and\ Documentation
│   ├── ESTrainer.mlapp
│   ├── Speech\ Emotion\ Recognition\ Adapted\ to\ Multimodal\ Semantic\ Repositories_documentation.pdf
│   ├── Speech\ Emotion\ Recognition\ for\ Performance\ Interaction.pdf
│   └── readme.txt
├── anger
│   ├── a01\ (1).wav
│   ├── a01\ (2).wav
│   ├── ...
├── disgust
│   ├── d01\ (1).wav
│   ├── d01\ (2).wav
│   ├── ...
├── fear
│   ├── f01\ (1).wav
│   ├── f01\ (2).wav
│   ├── ...
├── happiness
│   ├── h01\ (1).wav
│   ├── h01\ (2).wav
│   ├── ...
└── sadness
    ├── s01\ (1).wav
    ├── s01\ (2).wav
    ├── ...

6 directories, 609 files
```

In [2]:
import numpy as np
import pandas as pd
import torch

from pathlib import Path
from tqdm import tqdm

import torchaudio
from sklearn.model_selection import train_test_split

import os
import sys

In [3]:
data = []

for path in tqdm(Path("/content/data/aesdd").glob("**/*.wav")):
    name = str(path).split('/')[-1].split('.')[0]
    label = str(path).split('/')[-2]
    
    try:
        # There are some broken files
        s = torchaudio.load(path)
        data.append({
            "name": name,
            "path": path,
            "emotion": label
        })
    except Exception as e:
        # print(str(path), e)
        pass

    # break

0it [00:00, ?it/s]


In [4]:
def load_custom_dataset():
    paths = []
    testpaths = []
    testlabels = []
    terminator = 'D:/Uni/19.Master/Daten/terminator.wav'
    print(sys.executable)
    emotions = []
    # for dirname, _, filenames in os.walk('Daten/TESS Toronto emotional speech set data'):
    # D:\Uni\19.Master\DATEN
    for dirname, _, filenames in os.walk('../tess'):
        for filename in filenames:
            label = filename.split('_')[-1]
            label = label.split('.')[0]
            if (label != 'neutral'):
                emotions.append(label.lower())
                paths.append(os.path.join(dirname, filename))
    for dirname, _, filenames in os.walk('../Stimuli_Intensitätsmorphs'):
        for filename in filenames:

            intens = filename.split('_')[-2]
            emot = filename.split('_')[1]
            label = emot
            match label:
                case 'ang':
                    label = 'angry'
                case 'dis':
                    label = 'disgust'
                case 'fea':
                    label = 'fear'
                case 'hap':
                    label = 'happy'
                case 'sad':
                    label = 'sad'
                case 'sur':
                    label = 'ps'
            if (emot != 'ple'):
                testpaths.append(os.path.join(dirname, filename))
                testlabels.append(label.lower())
    com_labels = testlabels + emotions
    com_paths = testpaths + paths
    print(testlabels)
    print(testpaths)
    print('Dataset is loaded')
    return paths, emotions, testpaths, testlabels
trainpaths, trainlabels, testpaths, testlabels = load_custom_dataset()

###create dataframes for training and testing###
trainDF = pd.DataFrame()
trainDF["path"] = trainpaths
trainDF["emotion"] = trainlabels

testDF = pd.DataFrame()
testDF["path"] = testpaths
testDF["emotion"] = testlabels


testDF

/home/ckwdani/Programming/Projects/masterarbeit/Jupyter/mainProject/venv/bin/python
['angry', 'angry', 'ps', 'happy', 'fear', 'sad', 'fear', 'ps', 'happy', 'disgust', 'sad', 'sad', 'happy', 'disgust', 'disgust', 'fear', 'sad', 'happy', 'angry', 'happy', 'disgust', 'disgust', 'happy', 'happy', 'disgust', 'disgust', 'angry', 'happy', 'disgust', 'sad', 'sad', 'happy', 'ps', 'disgust', 'sad', 'disgust', 'fear', 'disgust', 'happy', 'happy', 'happy', 'sad', 'fear', 'fear', 'fear', 'sad', 'ps', 'angry', 'sad', 'disgust', 'angry', 'happy', 'happy', 'fear', 'fear', 'happy', 'angry', 'sad', 'happy', 'fear', 'fear', 'angry', 'happy', 'happy', 'angry', 'happy', 'happy', 'fear', 'happy', 'fear', 'happy', 'happy', 'angry', 'sad', 'fear', 'ps', 'ps', 'angry', 'happy', 'happy', 'sad', 'ps', 'ps', 'ps', 'angry', 'angry', 'disgust', 'ps', 'ps', 'sad', 'happy', 'ps', 'happy', 'disgust', 'disgust', 'sad', 'sad', 'happy', 'angry', 'angry', 'sad', 'happy', 'happy', 'ps', 'disgust', 'sad', 'fear', 'ps', 'hap

Unnamed: 0,path,emotion
0,../Stimuli_Intensitätsmorphs/nf02_ang_w05_o_10...,angry
1,../Stimuli_Intensitätsmorphs/nm01_ang_w01_o_50...,angry
2,../Stimuli_Intensitätsmorphs/nf04_sur_w03_c_10...,ps
3,../Stimuli_Intensitätsmorphs/nf03_hap_w01_o_75...,happy
4,../Stimuli_Intensitätsmorphs/nm01_fea_w01_o_50...,fear
...,...,...
763,../Stimuli_Intensitätsmorphs/nm02_sur_w05_o_75...,ps
764,../Stimuli_Intensitätsmorphs/nm03_sad_w01_o_75...,sad
765,../Stimuli_Intensitätsmorphs/nf03_sad_w03_o_50...,sad
766,../Stimuli_Intensitätsmorphs/nm04_sad_w01_c_50...,sad


In [5]:
df = testDF
df.head()

Unnamed: 0,path,emotion
0,../Stimuli_Intensitätsmorphs/nf02_ang_w05_o_10...,angry
1,../Stimuli_Intensitätsmorphs/nm01_ang_w01_o_50...,angry
2,../Stimuli_Intensitätsmorphs/nf04_sur_w03_c_10...,ps
3,../Stimuli_Intensitätsmorphs/nf03_hap_w01_o_75...,happy
4,../Stimuli_Intensitätsmorphs/nm01_fea_w01_o_50...,fear


In [6]:
# Filter broken and non-existed paths

print(f"Step 0: {len(df)}")

df["status"] = df["path"].apply(lambda path: True if os.path.exists(path) else None)
df = df.dropna(subset=["path"])
df = df.drop("status", 1)
print(f"Step 1: {len(df)}")

df = df.sample(frac=1)
df = df.reset_index(drop=True)
df.head()

Step 0: 768
Step 1: 768


  df = df.drop("status", 1)


Unnamed: 0,path,emotion
0,../Stimuli_Intensitätsmorphs/nm04_sad_w03_c_10...,sad
1,../Stimuli_Intensitätsmorphs/nm01_fea_w05_c_10...,fear
2,../Stimuli_Intensitätsmorphs/nf01_ang_w05_o_50...,angry
3,../Stimuli_Intensitätsmorphs/nf04_sur_w02_o_50...,ps
4,../Stimuli_Intensitätsmorphs/nf03_ang_w03_o_10...,angry


Let's explore how many labels (emotions) are in the dataset with what distribution.

In [7]:
print("Labels: ", df["emotion"].unique())
print()
df.groupby("emotion").count()[["path"]]

Labels:  ['sad' 'fear' 'angry' 'ps' 'happy' 'disgust']



Unnamed: 0_level_0,path
emotion,Unnamed: 1_level_1
angry,128
disgust,128
fear,128
happy,128
ps,128
sad,128


Let's display some random sample of the dataset and run it a couple of times to get a feeling for the audio and the emotional label.

In [8]:
import torchaudio
import librosa
import IPython.display as ipd
import numpy as np

idx = np.random.randint(0, len(df))
sample = df.iloc[idx]
path = sample["path"]
label = sample["emotion"]


print(f"ID Location: {idx}")
print(f"      Label: {label}")
print()

speech, sr = torchaudio.load(path)
speech = speech[0].numpy().squeeze()
speech = librosa.resample(np.asarray(speech), sr, 16_000)
ipd.Audio(data=np.asarray(speech), autoplay=True, rate=16000)

ID Location: 571
      Label: fear



  speech = librosa.resample(np.asarray(speech), sr, 16_000)


For training purposes, we need to split data into train test sets; in this specific example, we break with a `20%` rate for the test set.

In [9]:
#save_path = "/content/data"
save_path="content/data"
train_df, test_df = train_test_split(df, test_size=0.2, random_state=101, stratify=df["emotion"])

train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

#train_df.to_csv(f"{save_path}/train.csv", sep="\t", encoding="utf-8", index=False)
#test_df.to_csv(f"{save_path}/test.csv", sep="\t", encoding="utf-8", index=False)


print(train_df.shape)
print(test_df.shape)

(614, 2)
(154, 2)


## Prepare Data for Training

In [10]:
# Loading the created dataset using datasets
from datasets import load_dataset, load_metric


data_files = {
    "train": "train.csv",
    "validation": "test.csv",
}

dataset = load_dataset("content/data/", data_files=data_files, delimiter="\t", )
train_dataset = dataset["train"]
eval_dataset = dataset["validation"]

# We need to specify the input and output column
input_column = "path"
output_column = "emotion"
# we need to distinguish the unique labels in our SER dataset
label_list = train_dataset.unique(output_column)
label_list.sort()  # Let's sort it for determinism
num_labels = len(label_list)
#print(f"A classification problem with {num_labels} classes: {label_list}")

Using custom data configuration data-f5a87fddb5719e24
Found cached dataset csv (/home/ckwdani/Programming/Projects/masterarbeit/Jupyter/mainProject/notebooks/content/cache/csv/data-f5a87fddb5719e24/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)


  0%|          | 0/2 [00:00<?, ?it/s]

In order to preprocess the audio into our classification model, we need to set up the relevant Wav2Vec2 assets regarding our language in this case `lighteternal/wav2vec2-large-xlsr-53-greek` fine-tuned by [Dimitris Papadopoulos](https://huggingface.co/lighteternal/wav2vec2-large-xlsr-53-greek). To handle the context representations in any audio length we use a merge strategy plan (pooling mode) to concatenate that 3D representations into 2D representations.

There are three merge strategies `mean`, `sum`, and `max`. In this example, we achieved better results on the mean approach. In the following, we need to initiate the config and the feature extractor from the Dimitris model.

In [11]:
from transformers import AutoConfig, Wav2Vec2Processor

#model_name_or_path = "lighteternal/wav2vec2-large-xlsr-53-greek"
model_name_or_path = "jonatasgrosman/wav2vec2-large-xlsr-53-german"
pooling_mode = "mean"


# config
config = AutoConfig.from_pretrained(
    model_name_or_path,
    num_labels=num_labels,
    label2id={label: i for i, label in enumerate(label_list)},
    id2label={i: label for i, label in enumerate(label_list)},
    finetuning_task="wav2vec2_clf",
)
setattr(config, 'pooling_mode', pooling_mode)

processor = Wav2Vec2Processor.from_pretrained(model_name_or_path, )
target_sampling_rate = processor.feature_extractor.sampling_rate
print(f"The target sampling rate: {target_sampling_rate}")

2022-12-09 17:07:30.359669: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-09 17:07:30.892950: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-12-09 17:07:30.893005: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory


The target sampling rate: 16000


# Preprocess Data

So far, we downloaded, loaded, and split the SER dataset into train and test sets. The instantiated our strategy configuration for using context representations in our classification problem SER. Now, we need to extract features from the audio path in context representation tensors and feed them into our classification model to determine the emotion in the speech.

Since the audio file is saved in the `.wav` format, it is easy to use **[Librosa](https://librosa.org/doc/latest/index.html)** or others, but we suppose that the format may be in the `.mp3` format in case of generality. We found that the **[Torchaudio](https://pytorch.org/audio/stable/index.html)** library works best for reading in `.mp3` data.

An audio file usually stores both its values and the sampling rate with which the speech signal was digitalized. We want to store both in the dataset and write a **map(...)** function accordingly. Also, we need to handle the string labels into integers for our specific classification task in this case, the **single-label classification** you may want to use for your **regression** or even **multi-label classification**.

In [12]:
import utils.audio_dataset_utils as audioUtils

def preprocess_function(examples):
    speech_list = [audioUtils.speech_file_to_array_librosa(path, target_sampling_rate) for path in examples[input_column]]
    target_list = [audioUtils.label_to_id(label, label_list) for label in examples[output_column]]

    result = processor(speech_list, sampling_rate=target_sampling_rate)
    result["labels"] = list(target_list)

    return result


In [13]:
train_dataset = train_dataset.map(
    preprocess_function,
    batch_size=50,
    batched=True,
    #num_proc=4
)
eval_dataset = eval_dataset.map(
    preprocess_function,
    batch_size=50,
    batched=True,
    #num_proc=4
)

Loading cached processed dataset at /home/ckwdani/Programming/Projects/masterarbeit/Jupyter/mainProject/notebooks/content/cache/csv/data-f5a87fddb5719e24/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-3b256eb0e70bbc35.arrow
Loading cached processed dataset at /home/ckwdani/Programming/Projects/masterarbeit/Jupyter/mainProject/notebooks/content/cache/csv/data-f5a87fddb5719e24/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-c8750dcec87c031b.arrow


In [14]:
# idx = 0
# print(f"Training input_values: {train_dataset[idx]['input_values']}")
# print(f"Training attention_mask: {train_dataset[idx]['attention_mask']}")
# print(f"Training labels: {train_dataset[idx]['labels']} - {train_dataset[idx]['emotion']}")

Great, now we've successfully read all the audio files, resampled the audio files to 16kHz, and mapped each audio to the corresponding label.

## Model

Before diving into the training part, we need to build our classification model based on the merge strategy.

In [15]:
import network_models.w2v_emotion_model.model as KNNModel

## Training

The data is processed so that we are ready to start setting up the training pipeline. We will make use of 🤗's [Trainer](https://huggingface.co/transformers/master/main_classes/trainer.html?highlight=trainer) for which we essentially need to do the following:

- Define a data collator. In contrast to most NLP models, XLSR-Wav2Vec2 has a much larger input length than output length. *E.g.*, a sample of input length 50000 has an output length of no more than 100. Given the large input sizes, it is much more efficient to pad the training batches dynamically meaning that all training samples should only be padded to the longest sample in their batch and not the overall longest sample. Therefore, fine-tuning XLSR-Wav2Vec2 requires a special padding data collator, which we will define below

- Evaluation metric. During training, the model should be evaluated on the word error rate. We should define a `compute_metrics` function accordingly

- Load a pretrained checkpoint. We need to load a pretrained checkpoint and configure it correctly for training.

- Define the training configuration.

After having fine-tuned the model, we will correctly evaluate it on the test data and verify that it has indeed learned to correctly transcribe speech.

### Set-up Trainer

Let's start by defining the data collator. The code for the data collator was copied from [this example](https://github.com/huggingface/transformers/blob/9a06b6b11bdfc42eea08fa91d0c737d1863c99e3/examples/research_projects/wav2vec2/run_asr.py#L81).

Without going into too many details, in contrast to the common data collators, this data collator treats the `input_values` and `labels` differently and thus applies to separate padding functions on them (again making use of XLSR-Wav2Vec2's context manager). This is necessary because in speech input and output are of different modalities meaning that they should not be treated by the same padding function.
Analogous to the common data collators, the padding tokens in the labels with `-100` so that those tokens are **not** taken into account when computing the loss.

In [16]:
import network_models.w2v_emotion_model.trainer as trainerUtils
data_collator = trainerUtils.DataCollatorCTCWithPadding(processor=processor, padding=True)


Next, the evaluation metric is defined. There are many pre-defined metrics for classification/regression problems, but in this case, we would continue with just **Accuracy** for classification and **MSE** for regression. You can define other metrics on your own.

In [17]:
is_regression = False

In [18]:
import numpy as np
from transformers import EvalPrediction


def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)

    if is_regression:
        return {"mse": ((preds - p.label_ids) ** 2).mean().item()}
    else:
        return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}

Now, we can load the pretrained XLSR-Wav2Vec2 checkpoint into our classification model with a pooling strategy.

In [19]:
model = KNNModel.Wav2Vec2ForSpeechClassification.from_pretrained(
    model_name_or_path,
    config=config,
)

Some weights of the model checkpoint at jonatasgrosman/wav2vec2-large-xlsr-53-german were not used when initializing Wav2Vec2ForSpeechClassification: ['lm_head.bias', 'lm_head.weight']
- This IS expected if you are initializing Wav2Vec2ForSpeechClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSpeechClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSpeechClassification were not initialized from the model checkpoint at jonatasgrosman/wav2vec2-large-xlsr-53-german and are newly initialized: ['classifier.out_proj.weight', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.dense.bias']
You should probably TRAIN this model 

The first component of XLSR-Wav2Vec2 consists of a stack of CNN layers that are used to extract acoustically meaningful - but contextually independent - features from the raw speech signal. This part of the model has already been sufficiently trained during pretraining and as stated in the [paper](https://arxiv.org/pdf/2006.13979.pdf) does not need to be fine-tuned anymore. 
Thus, we can set the `requires_grad` to `False` for all parameters of the *feature extraction* part.

In [20]:
model.freeze_feature_extractor()

In a final step, we define all parameters related to training. 
To give more explanation on some of the parameters:
- `learning_rate` and `weight_decay` were heuristically tuned until fine-tuning has become stable. Note that those parameters strongly depend on the Common Voice dataset and might be suboptimal for other speech datasets.

For more explanations on other parameters, one can take a look at the [docs](https://huggingface.co/transformers/master/main_classes/trainer.html?highlight=trainer#trainingarguments).

**Note**: If one wants to save the trained models in his/her google drive the commented-out `output_dir` can be used instead.

For future use we can create our training script, we do it in a simple way. You can add more on you own.

Now, all instances can be passed to Trainer and we are ready to start training!

In [21]:
# from google.colab import drive

# drive.mount('/gdrive')

In [22]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="content/models",
    # output_dir="/content/gdrive/MyDrive/wav2vec2-xlsr-greek-speech-emotion-recognition"
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    evaluation_strategy="steps",
    num_train_epochs=10.0,
    fp16=True,
    save_steps=10,
    eval_steps=10,
    logging_steps=10,
    learning_rate=1e-4,
    save_total_limit=2,
)
trainer = trainerUtils.CTCTrainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=processor.feature_extractor,
)

Using cuda_amp half precision backend


### Training

Training will take between 10 and 60 minutes depending on the GPU allocated to this notebook. 

In case you want to use this google colab to fine-tune your model, you should make sure that your training doesn't stop due to inactivity. A simple hack to prevent this is to paste the following code into the console of this tab (right mouse click -> inspect -> Console tab and insert code).

```javascript
function ConnectButton(){
    console.log("Connect pushed"); 
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click() 
}
setInterval(ConnectButton,60000);
```

In [23]:
torch.cuda.empty_cache()

#trainer.train()

In [24]:
import torch
#torch.cuda.memory_allocated()
torch.cuda.empty_cache()

In [25]:
trainer.save_model(output_dir='content/model')

Saving model checkpoint to content/model
Configuration saved in content/model/config.json
Model weights saved in content/model/pytorch_model.bin
Feature extractor saved in content/model/preprocessor_config.json


The training loss goes down and we can see that the Acurracy on the test set also improves nicely. Because this notebook is just for demonstration purposes, we can stop here.

The resulting model of this notebook has been saved to [m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition](https://huggingface.co/m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition)

As a final check, let's load the model and verify that it indeed has learned to recognize the emotion in the speech.

Let's first load the pretrained checkpoint.

## Evaluation

In [26]:
import librosa
from sklearn.metrics import classification_report

In [27]:
test_dataset = load_dataset("csv", data_files={"validation": "content/data/test.csv"}, delimiter="\t")["validation"]
test_dataset

Using custom data configuration default-92f8667effa50d69
Found cached dataset csv (/home/ckwdani/Programming/Projects/masterarbeit/Jupyter/mainProject/notebooks/content/cache/csv/default-92f8667effa50d69/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)


  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['path', 'emotion'],
    num_rows: 154
})

In [28]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

Device: cuda


In [29]:
model_name_or_path = "content/model"
#config = AutoConfig.from_pretrained(model_name_or_path)
#processor = Wav2Vec2Processor.from_pretrained(model_name_or_path)
model = KNNModel.Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)

loading configuration file content/model/config.json
Model config Wav2Vec2Config {
  "_name_or_path": "jonatasgrosman/wav2vec2-large-xlsr-53-german",
  "activation_dropout": 0.05,
  "adapter_kernel_size": 3,
  "adapter_stride": 2,
  "add_adapter": false,
  "apply_spec_augment": true,
  "architectures": [
    "Wav2Vec2ForSpeechClassification"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 1,
  "classifier_proj_size": 256,
  "codevector_dim": 768,
  "contrastive_logits_temperature": 0.1,
  "conv_bias": true,
  "conv_dim": [
    512,
    512,
    512,
    512,
    512,
    512,
    512
  ],
  "conv_kernel": [
    10,
    3,
    3,
    3,
    3,
    2,
    2
  ],
  "conv_stride": [
    5,
    2,
    2,
    2,
    2,
    2,
    2
  ],
  "ctc_loss_reduction": "mean",
  "ctc_zero_infinity": true,
  "diversity_loss_weight": 0.1,
  "do_stable_layer_norm": true,
  "eos_token_id": 2,
  "feat_extract_activation": "gelu",
  "feat_extract_dropout": 0.0,
  "feat_extract_norm": "layer",
  "feat_pr

In [30]:
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    speech_array = speech_array.squeeze().numpy()
    speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, processor.feature_extractor.sampling_rate)

    batch["speech"] = speech_array
    return batch


def predict(batch):
    features = processor(batch["speech"], sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt", padding=True)

    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)

    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits 

    pred_ids = torch.argmax(logits, dim=-1).detach().cpu().numpy()
    batch["predicted"] = pred_ids
    return batch

In [31]:
test_dataset = test_dataset.map(speech_file_to_array_fn)

Loading cached processed dataset at /home/ckwdani/Programming/Projects/masterarbeit/Jupyter/mainProject/notebooks/content/cache/csv/default-92f8667effa50d69/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-22b2b8397926ee9d.arrow


In [32]:
result = test_dataset.map(predict, batched=True, batch_size=8)

  0%|          | 0/20 [00:00<?, ?ba/s]

In [33]:
label_names = [config.id2label[i] for i in range(config.num_labels)]
label_names

['angry', 'disgust', 'fear', 'happy', 'ps', 'sad']

In [34]:
result[0]

{'path': '../Stimuli_Intensitätsmorphs/nm04_sad_w02_c_50_70dB.wav',
 'emotion': 'sad',
 'speech': [1.4362941946899355e-08,
  -2.2597133053636753e-08,
  3.1742164452452926e-08,
  -4.121974583881638e-08,
  5.0132950235592943e-08,
  -5.7397191000063685e-08,
  6.135852004263143e-08,
  -6.041211264573576e-08,
  5.2473389189344743e-08,
  -3.5618906935042105e-08,
  7.522541700666352e-09,
  3.3769627094670795e-08,
  -9.030095782236458e-08,
  1.6409683212259552e-07,
  -2.584087610557617e-07,
  3.789788536323613e-07,
  -5.408544438978424e-07,
  7.900629839241446e-07,
  -1.3239421150501585e-06,
  4.075676315551391e-06,
  2.991001383634284e-05,
  3.035163172171451e-05,
  3.5935054256697185e-06,
  -9.816906185733387e-07,
  6.11801908689813e-07,
  -5.792601314169588e-07,
  7.009504656707577e-07,
  -9.691298146208283e-07,
  1.4927182974133757e-06,
  -2.832466407198808e-06,
  1.7455626220908016e-05,
  5.6992186728166416e-05,
  8.202600292861462e-05,
  6.328233575914055e-05,
  6.567654781974852e-05,
  

In [35]:
y_true = [config.label2id[name] for name in result["emotion"]]


y_pred = result["predicted"]

print(y_true[:5])
print(y_pred[:5])

[5, 1, 1, 4, 5]
[4, 4, 4, 4, 4]


In [36]:
print(result['predicted'])
#config.label2id['happy']

[4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 3, 4, 4, 4, 4, 3, 2, 3, 4, 4, 4, 4, 3, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 3, 3, 4, 4, 4, 4, 4, 4, 3, 4, 3, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0, 4, 3, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 0, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 3, 4, 2, 4, 0, 4, 4, 4, 3, 4, 3, 4, 4, 4, 4, 3, 4, 4]


In [37]:
print(classification_report(y_true, y_pred, target_names=label_names))

              precision    recall  f1-score   support

       angry       0.00      0.00      0.00        25
     disgust       0.00      0.00      0.00        26
        fear       1.00      0.12      0.21        26
       happy       0.19      0.19      0.19        26
          ps       0.20      0.92      0.33        26
         sad       0.00      0.00      0.00        25

    accuracy                           0.21       154
   macro avg       0.23      0.21      0.12       154
weighted avg       0.23      0.21      0.12       154



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Prediction

In [38]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import AutoConfig, Wav2Vec2Processor

import librosa
import IPython.display as ipd
import numpy as np
import pandas as pd

In [45]:
#device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
##model_name_or_path = "m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition"
## config = AutoConfig.from_pretrained(model_name_or_path)
#processor = Wav2Vec2Processor.from_pretrained(model_name_or_path)
sampling_rate = processor.feature_extractor.sampling_rate
#model = KNNModel.Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)

In [46]:
def speech_file_to_array_fn(path, sampling_rate):
    speech_array, _sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(_sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech


def predict(path, sampling_rate):
    speech = speech_file_to_array_fn(path, sampling_rate)
    features = processor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)

    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)

    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits

    scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{"Emotion": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in enumerate(scores)]
    return outputs


STYLES = """
<style>
div.display_data {
    margin: 0 auto;
    max-width: 500px;
}
table.xxx {
    margin: 50px !important;
    float: right !important;
    clear: both !important;
}
table.xxx td {
    min-width: 300px !important;
    text-align: center !important;
}
</style>
""".strip()

def prediction(df_row):
    path, emotion = df_row["path"], df_row["emotion"]
    df = pd.DataFrame([{"Emotion": emotion, "Sentence": "    "}])
    setup = {
        'border': 2,
        'show_dimensions': True,
        'justify': 'center',
        'classes': 'xxx',
        'escape': False,
    }
    ipd.display(ipd.HTML(STYLES + df.to_html(**setup) + "<br />"))
    speech, sr = torchaudio.load(path)
    speech = speech[0].numpy().squeeze()
    speech = librosa.resample(np.asarray(speech), sr, sampling_rate)
    ipd.display(ipd.Audio(data=np.asarray(speech), autoplay=True, rate=sampling_rate))

    outputs = predict(path, sampling_rate)
    r = pd.DataFrame(outputs)
    ipd.display(ipd.HTML(STYLES + r.to_html(**setup) + "<br />"))

In [47]:
test = pd.read_csv("content/data/test.csv", sep="\t")
test.head()

Unnamed: 0,path,emotion
0,../Stimuli_Intensitätsmorphs/nm04_sad_w02_c_50...,sad
1,../Stimuli_Intensitätsmorphs/nf01_dis_w01_o_50...,disgust
2,../Stimuli_Intensitätsmorphs/nm02_dis_w02_o_25...,disgust
3,../Stimuli_Intensitätsmorphs/nm04_sur_w05_o_75...,ps
4,../Stimuli_Intensitätsmorphs/nf03_sad_w03_o_10...,sad


In [50]:
prediction(test.iloc[0])

Unnamed: 0,Emotion,Sentence
0,sad,


  speech = librosa.resample(np.asarray(speech), sr, sampling_rate)


Unnamed: 0,Emotion,Score
0,angry,15.5%
1,disgust,15.2%
2,fear,15.0%
3,happy,18.1%
4,ps,18.7%
5,sad,17.6%


In [49]:
prediction(test.iloc[15])

Unnamed: 0,Emotion,Sentence
0,angry,


  speech = librosa.resample(np.asarray(speech), sr, sampling_rate)


Unnamed: 0,Emotion,Score
0,angry,17.1%
1,disgust,16.6%
2,fear,15.7%
3,happy,16.7%
4,ps,18.2%
5,sad,15.7%


In [None]:
prediction(test.iloc[2])