## Disclaimer 
Current notebook was taken from Mehrdad Farahani (https://huggingface.co/m3hrdadfi) and modified according to our purposes.

# Emotion Recognition in Greek Speech Using Wav2Vec 2.0

**Wav2Vec 2.0** is a pretrained model for Automatic Speech Recognition (ASR) and was released in [September 2020](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) by Alexei Baevski, Michael Auli, and Alex Conneau.  Soon after the superior performance of Wav2Vec2 was demonstrated on the English ASR dataset LibriSpeech, *Facebook AI* presented XLSR-Wav2Vec2 (click [here](https://arxiv.org/abs/2006.13979)). XLSR stands for *cross-lingual  speech representations* and refers to XLSR-Wav2Vec2`s ability to learn speech representations that are useful across multiple languages.

Similar to Wav2Vec2, XLSR-Wav2Vec2 learns powerful speech representations from hundreds of thousands of hours of speech in more than 50 languages of unlabeled speech. Similar, to [BERT's masked language modeling](http://jalammar.github.io/illustrated-bert/), the model learns contextualized speech representations by randomly masking feature vectors before passing them to a transformer network.

![wav2vec2_structure](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/xlsr_wav2vec2.png)

The authors show for the first time that massively pretraining an ASR model on cross-lingual unlabeled speech data, followed by language-specific fine-tuning on very little labeled data achieves state-of-the-art results. See Table 1-5 of the official [paper](https://arxiv.org/pdf/2006.13979.pdf).

During fine-tuning week hosted by HuggingFace, more than 300 people participated in tuning XLSR-Wav2Vec2's pretrained on low-resources ASR dataset for more than 50 languages. This model is fine-tuned using [Connectionist Temporal Classification](https://distill.pub/2017/ctc/) (CTC), an algorithm used to train neural networks for sequence-to-sequence problems and mainly in Automatic Speech Recognition and handwriting recognition. Follow this [notebook](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_Tune_XLSR_Wav2Vec2_on_Turkish_ASR_with_%F0%9F%A4%97_Transformers.ipynb#scrollTo=Gx9OdDYrCtQ1) for more information about XLSR-Wav2Vec2 fine-tuning.

This model was shown significant results in many low-resources languages. You can see the [competition board](https://paperswithcode.com/dataset/common-voice) or even testing the models from the [HuggingFace hub](https://huggingface.co/models?filter=xlsr-fine-tuning-week). 


In this notebook, we will go through how to use this model to recognize the emotional aspects of speech in a language (or even as a general view using for every classification problem). Before going any further, we need to install some handy packages and define some enviroment values.

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
!unzip ./drive/MyDrive/all_data.zip

Archive:  ./drive/MyDrive/all_data.zip
   creating: audio_files/
   creating: trimmed/
  inflating: standart_data.csv       
  inflating: data.csv                
 extracting: audio_files.zip         
 extracting: trimmed.zip             


In [3]:
!rmdir trimmed audio_files

In [4]:
!unzip  trimmed.zip 

Archive:  trimmed.zip
   creating: trimmed/
  inflating: trimmed/Sib_13-f_85.29_87.62.wav  
  inflating: trimmed/Sib_01-f_133.73_134.91.wav  
  inflating: trimmed/Sib_07-f_22.85_23.7.wav  
  inflating: trimmed/Sib_08-f_66.97_68.83.wav  
  inflating: trimmed/Sib_12-f_61.24_62.46.wav  
  inflating: trimmed/Sib_03-m_44.69_46.76.wav  
  inflating: trimmed/Sib_17-m_277.76_285.24.wav  
  inflating: trimmed/Sib_09-m_62.59_64.41.wav  
  inflating: trimmed/Sib_07-f_68.17_70.5.wav  
  inflating: trimmed/Sib_14-f_55.94_57.75.wav  
  inflating: trimmed/Sib_17-m_80.66_82.32.wav  
  inflating: trimmed/Sib_11-m_90.12_91.39.wav  
  inflating: trimmed/Sib_15-f_103.38_104.83.wav  
  inflating: trimmed/Sib_11-m_107.91_108.7.wav  
  inflating: trimmed/Sib_09-m_48.23_53.44.wav  
  inflating: trimmed/Sib_12-f_74.32_74.74.wav  
  inflating: trimmed/Sib_09-m_91.38_96.22.wav  
  inflating: trimmed/Sib_07-f_35.59_37.29.wav  
  inflating: trimmed/Sib_11-m_125.36_126.29.wav  
  inflating: trimmed/Sib_13-f_65.379_

In [5]:
!unzip  audio_files.zip

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
  inflating: audio_files/20180619_vav1949-193224-197611.wav  
  inflating: audio_files/04072022_TNG1957_Melikhovskaya-7506527-7507734.wav  
  inflating: audio_files/04072022_MLI1941_Melikhovskaya-1868330-1869700.wav  
  inflating: audio_files/20180618_zii1932_a-417285-418818.wav  
  inflating: audio_files/20180618_zii1932_a-978202-980282.wav  
  inflating: audio_files/020721_VIK1941_Razdorskaya-4945966-4953449.wav  
  inflating: audio_files/20180618_enm1930-154743-156780.wav  
  inflating: audio_files/20180618_zii1932_a-3876844-3877547.wav  
  inflating: audio_files/20180619_vav1949-15993-17537.wav  
  inflating: audio_files/20180618_zii1932_a-274612-278412.wav  
  inflating: audio_files/04072022_TNG1957_Melikhovskaya-4046632-4047586.wav  
  inflating: audio_files/04072022_TNG1957_Melikhovskaya-1435930-1437110.wav  
  inflating: audio_files/Keba_MAS1916_2-2591059-2593141.wav  
  inflating: audio_files/020

In [6]:
%%capture

!pip install git+https://github.com/huggingface/datasets.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install jiwer
!pip install torchaudio
!pip install librosa
!pip install --upgrade accelerate

# Monitor the training process
# !pip install wandb

In [7]:
%env LC_ALL=C.UTF-8
%env LANG=C.UTF-8
%env TRANSFORMERS_CACHE=/content/cache
%env HF_DATASETS_CACHE=/content/cache
%env CUDA_LAUNCH_BLOCKING=1

env: LC_ALL=C.UTF-8
env: LANG=C.UTF-8
env: TRANSFORMERS_CACHE=/content/cache
env: HF_DATASETS_CACHE=/content/cache
env: CUDA_LAUNCH_BLOCKING=1


In [8]:
import numpy as np
import pandas as pd

from pathlib import Path
from tqdm import tqdm

import torchaudio
from sklearn.model_selection import train_test_split

import os
import sys

In [9]:
df = pd.read_csv('/content/data.csv', index_col=0)
df_standart = pd.read_csv('standart_data.csv', index_col=0)
df_int = df[df['informant'] == 'Interviewer']
#df_last = pd.concat([df_standart, df_int])
# df_last = df_last.sample(frac=1).head(1207)
df_opochka = df[df['corpus'] == 'opochka'].sample(frac=1).head(1207)
df_don = df[df['corpus'] == 'don_rnd'].sample(frac=1).head(1207)
df_keba = df[df['corpus'] == 'keba'].sample(frac=1).head(1207)
df = pd.concat([df_int, df_opochka, df_keba, df_don]).sample(frac=1)

In [None]:
def age_groups(age):
    if age < 25:
        return 'young'
    elif 40 < age < 46:
        return 'middle'
    else:
        return 'elderly'

In [10]:
# df['age_group'] = df['age'].apply(age_groups)
# df = df.sample(frac=1)
df

Unnamed: 0,informant,start,end,filename,corpus,text,gender,age
Keba_KMV1919-609347-613961,KMV1919,609.347,613.961,audio_files/Keba_KMV1919-609347-613961.wav,keba,"было молоко, так ульёшь молочко-то, серое-то, ...",,
020721_VIK1941_Razdorskaya-3631716-3636432,VIK1941,3631.716,3636.432,audio_files/020721_VIK1941_Razdorskaya-3631716...,don_rnd,"Пошёл зарабатывать, живёт с девочкой полтора г...",,
20180618_zii1932_a-417285-418818,Interviewer,417.285,418.818,audio_files/20180618_zii1932_a-417285-418818.wav,standart,Они где-то рядом живут здесь?,,
04072022_TNG1957_Melikhovskaya-6126273-6127822,TNG1957,6126.273,6127.822,audio_files/04072022_TNG1957_Melikhovskaya-612...,don_rnd,Туда доеду на троллейбусе.,,
Keba_MAS1916_2-2136785-2144243,Interviewer,2136.785,2144.243,audio_files/Keba_MAS1916_2-2136785-2144243.wav,standart,"Как, Вы говорите, что стало-то, вот она, там э...",,
...,...,...,...,...,...,...,...,...
Keba_MAS1916_2-2792847-2795001,Interviewer,2792.847,2795.001,audio_files/Keba_MAS1916_2-2792847-2795001.wav,standart,"Ну, уж устали, да, замучили Вас?",,
Keba_MAS1916_2-1302173-1308036,MAS1916,1302.173,1308.036,audio_files/Keba_MAS1916_2-1302173-1308036.wav,keba,"он пришёл, стал звать опять, а звали Фёклой, м...",,
020721_VIK1941_Razdorskaya-6074599-6077920,VIK1941,6074.599,6077.920,audio_files/020721_VIK1941_Razdorskaya-6074599...,don_rnd,а на четвёртый уже могут разбежаться.,,
Keba_KMV1919-406846-416230,KMV1919,406.846,416.230,audio_files/Keba_KMV1919-406846-416230.wav,keba,"А я боле скотный бросила, да ушла с има, да во...",,


In [11]:
# Filter broken and non-existed paths

print(f"Step 0: {len(df)}")

df["status"] = df["filename"].apply(lambda path: True if os.path.exists(path) else None)
df = df.dropna(subset=["filename"])
df = df.drop("status", 1)
print(f"Step 1: {len(df)}")

df = df.sample(frac=1)
df = df.reset_index(drop=True)
df = df.head(3000)
eval_csv = df.tail(1028)
eval_csv.to_csv('eval.csv')
df

Step 0: 5179
Step 1: 5179


  df = df.drop("status", 1)


Unnamed: 0,informant,start,end,filename,corpus,text,gender,age
0,KMV1919,3312.870,3322.183,audio_files/Keba_KMV1919-3312870-3322183.wav,keba,Дак я на его вот паром дак вот так близко подо...,,
1,Interviewer,974.003,976.027,audio_files/20180619_vav1949-974003-976027.wav,standart,Как это называется место?,,
2,MAS1916,1669.715,1672.148,audio_files/Keba_MAS1916_2-1669715-1672148.wav,keba,"Но, но. Не пожалела рыбу-то?",,
3,вав1949,326.473,328.697,audio_files/20180619_vav1949-326473-328697.wav,opochka,"Может я путаю, может это не эти люди?",,
4,Interviewer,954.258,954.770,audio_files/20180618_zii1932_a-954258-954770.wav,standart,Здорово.,,
...,...,...,...,...,...,...,...,...
2995,VIK1941,289.918,292.298,audio_files/020721_VIK1941_Razdorskaya-289918-...,don_rnd,Ну как вам сказать?,,
2996,MAS1916,2311.682,2316.812,audio_files/Keba_MAS1916_2-2311682-2316812.wav,keba,"Но, но, ремонтируют сейчас они, ремонтируют, ц...",,
2997,KMV1919,571.241,575.563,audio_files/Keba_KMV1919-571241-575563.wav,keba,"а это - житники, половина той муки, половина д...",,
2998,KMV1919,303.410,314.689,audio_files/Keba_KMV1919-303410-314689.wav,keba,"А нас встретили, и на пятнадцатый этаж нас соб...",,


Let's display some random sample of the dataset and run it a couple of times to get a feeling for the audio and the emotional label.

For training purposes, we need to split data into train test sets; in this specific example, we break with a `20%` rate for the test set.

In [14]:
save_path = "./"

train_df, test_df = train_test_split(df, test_size=0.2, random_state=101, stratify=df["corpus"])

train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

train_df.to_csv(f"{save_path}/train.csv", sep="\t", encoding="utf-8", index=False)
test_df.to_csv(f"{save_path}/test.csv", sep="\t", encoding="utf-8", index=False)


print(train_df.shape)
print(test_df.shape)

(2400, 8)
(600, 8)


## Prepare Data for Training

In [15]:
# Loading the created dataset using datasets
from datasets import load_dataset, load_metric


data_files = {
    "train": "./train.csv", 
    "validation": "./test.csv",
}

dataset = load_dataset("csv", data_files=data_files, delimiter="\t", )
train_dataset = dataset["train"]
eval_dataset = dataset["validation"]

print(train_dataset)
print(eval_dataset)

Downloading and preparing dataset csv/default to /content/cache/csv/default-74286ef6f9f08fcb/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /content/cache/csv/default-74286ef6f9f08fcb/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Dataset({
    features: ['informant', 'start', 'end', 'filename', 'corpus', 'text', 'gender', 'age'],
    num_rows: 2400
})
Dataset({
    features: ['informant', 'start', 'end', 'filename', 'corpus', 'text', 'gender', 'age'],
    num_rows: 600
})


In [16]:
# We need to specify the input and output column
input_column = "filename"
output_column = "corpus"

In [17]:
# we need to distinguish the unique labels in our SER dataset
label_list = train_dataset.unique(output_column)
label_list.sort()  # Let's sort it for determinism
num_labels = len(label_list)
print(f"A classification problem with {num_labels} classes: {label_list}")

A classification problem with 4 classes: ['don_rnd', 'keba', 'opochka', 'standart']


In [18]:
from transformers import AutoConfig, Wav2Vec2Processor

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [19]:
model_name_or_path = "jonatasgrosman/wav2vec2-large-xlsr-53-russian"
pooling_mode = "mean"

In [20]:
# config
config = AutoConfig.from_pretrained(
    model_name_or_path,
    num_labels=num_labels,
    label2id={label: i for i, label in enumerate(label_list)},
    id2label={i: label for i, label in enumerate(label_list)},
    finetuning_task="wav2vec2_clf",
)
setattr(config, 'pooling_mode', pooling_mode)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.78k [00:00<?, ?B/s]

In [21]:
processor = Wav2Vec2Processor.from_pretrained(model_name_or_path, padding='max_length')
target_sampling_rate = processor.feature_extractor.sampling_rate
print(f"The target sampling rate: {target_sampling_rate}")

Downloading (…)rocessor_config.json:   0%|          | 0.00/262 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

The target sampling rate: 16000


# Preprocess Data

In [22]:
from transformers import Wav2Vec2FeatureExtractor
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path, device='cuda')


In [26]:
def speech_file_to_array_fn(path):
    speech_array, sampling_rate = librosa.load(path, sr=16000)
    # resampler = torchaudio.transforms.Resample(sampling_rate, target_sampling_rate)
    speech = speech_array.squeeze()
    #print(speech.dtype)
    return speech

def label_to_id(label, label_list):

    if len(label_list) > 0:
        return label_list.index(label) if label in label_list else -1

    return label

def preprocess_function(examples):
    speech_list = [speech_file_to_array_fn(path) for path in examples[input_column]]
    target_list = [label_to_id(label, label_list) for label in examples[output_column]]

    result = processor(speech_list, sampling_rate=target_sampling_rate)
    result["labels"] = list(target_list)

    return result

In [24]:
import librosa

In [27]:
train_dataset = train_dataset.map(
    preprocess_function,
    batch_size=100,
    batched=True,
    num_proc=4)

eval_dataset = eval_dataset.map(
    preprocess_function,
    batch_size=100,
    batched=True,
    num_proc=4
)

Map (num_proc=4):   0%|          | 0/2400 [00:00<?, ? examples/s]

  tensor = as_tensor(value)
  tensor = as_tensor(value)
  tensor = as_tensor(value)
  tensor = as_tensor(value)


Map (num_proc=4):   0%|          | 0/600 [00:00<?, ? examples/s]

  tensor = as_tensor(value)
  tensor = as_tensor(value)
  tensor = as_tensor(value)
  tensor = as_tensor(value)


In [28]:
idx = 0
print(f"Training input_values: {train_dataset[idx]['input_values']}")
print(f"Training attention_mask: {train_dataset[idx]['attention_mask']}")
print(f"Training labels: {train_dataset[idx]['labels']} - {train_dataset[idx]['corpus']}")

Training input_values: [0.00141008326318115, -0.011075304821133614, 0.007313938345760107, -0.00562569173052907, -0.003314872505143285, -0.018251139670610428, -0.0061629945412278175, 0.005546377971768379, 0.014842886477708817, 0.011478825472295284, -0.008735816925764084, 0.008661319501698017, -0.0023054007906466722, -0.007740320172160864, -0.00927894189953804, -0.030467253178358078, -0.016865499317646027, -0.024820229038596153, -0.0040038269944489, -0.015043574385344982, -0.023456649854779243, -0.012607353739440441, -0.02571595087647438, -0.01554191019386053, -0.018696481361985207, -0.004526862408965826, -0.005064778961241245, 0.0036395536735653877, 0.019346699118614197, 0.017134355381131172, 0.017985785380005836, 0.025271188467741013, 0.007814222015440464, 0.016204742714762688, 0.010166766121983528, 0.02491781860589981, 0.0158368032425642, 0.010094750672578812, 0.0016645491123199463, 0.008911551907658577, 0.020025383681058884, 0.016766851767897606, 0.019674431532621384, -0.006394026800

## Model

In [29]:
from dataclasses import dataclass
from typing import Optional, Tuple
import torch
from transformers.file_utils import ModelOutput


@dataclass
class SpeechClassifierOutput(ModelOutput):
    loss: Optional[torch.FloatTensor] = None
    logits: torch.FloatTensor = None
    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
    attentions: Optional[Tuple[torch.FloatTensor]] = None


In [30]:
import torch
import torch.nn as nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss

from transformers.models.wav2vec2.modeling_wav2vec2 import (
    Wav2Vec2PreTrainedModel,
    Wav2Vec2Model
)


class Wav2Vec2ClassificationHead(nn.Module):
    """Head for wav2vec classification task."""

    def __init__(self, config):
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.final_dropout)
        self.out_proj = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, features, **kwargs):
        x = features
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x


class Wav2Vec2ForSpeechClassification(Wav2Vec2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.pooling_mode = config.pooling_mode
        self.config = config

        self.wav2vec2 = Wav2Vec2Model(config)
        self.classifier = Wav2Vec2ClassificationHead(config)

        self.init_weights()

    def freeze_feature_extractor(self):
        self.wav2vec2.feature_extractor._freeze_parameters()

    def merged_strategy(
            self,
            hidden_states,
            mode="mean"
    ):
        if mode == "mean":
            outputs = torch.mean(hidden_states, dim=1)
        elif mode == "sum":
            outputs = torch.sum(hidden_states, dim=1)
        elif mode == "max":
            outputs = torch.max(hidden_states, dim=1)[0]
        else:
            raise Exception(
                "The pooling method hasn't been defined! Your pooling mode must be one of these ['mean', 'sum', 'max']")

        return outputs

    def forward(
            self,
            input_values,
            attention_mask=None,
            output_attentions=None,
            output_hidden_states=None,
            return_dict=None,
            labels=None,
    ):
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        outputs = self.wav2vec2(
            input_values,
            attention_mask=attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        hidden_states = outputs[0]
        hidden_states = self.merged_strategy(hidden_states, mode=self.pooling_mode)
        logits = self.classifier(hidden_states)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return SpeechClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )


## Training

In [31]:
from dataclasses import dataclass
from typing import Dict, List, Optional, Union
import torch

import transformers
from transformers import Wav2Vec2Processor


@dataclass
class DataCollatorCTCWithPadding:
    """
    Data collator that will dynamically pad the inputs received.
    Args:
        processor (:class:`~transformers.Wav2Vec2Processor`)
            The processor used for proccessing the data.
        padding (:obj:`bool`, :obj:`str` or :class:`~transformers.tokenization_utils_base.PaddingStrategy`, `optional`, defaults to :obj:`True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:
            * :obj:`True` or :obj:`'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
              sequence if provided).
            * :obj:`'max_length'`: Pad to a maximum length specified with the argument :obj:`max_length` or to the
              maximum acceptable input length for the model if that argument is not provided.
            * :obj:`False` or :obj:`'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of
              different lengths).
        max_length (:obj:`int`, `optional`):
            Maximum length of the ``input_values`` of the returned list and optionally padding length (see above).
        max_length_labels (:obj:`int`, `optional`):
            Maximum length of the ``labels`` returned list and optionally padding length (see above).
        pad_to_multiple_of (:obj:`int`, `optional`):
            If set will pad the sequence to a multiple of the provided value.
            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
    """

    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [feature["labels"] for feature in features]

        d_type = torch.long if isinstance(label_features[0], int) else torch.float

        batch = self.processor.pad(
            input_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )

        batch["labels"] = torch.tensor(label_features, dtype=d_type)

        return batch

In [32]:
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

In [33]:
is_regression = False

In [34]:
import numpy as np
from transformers import EvalPrediction


def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    preds = np.squeeze(preds) if is_regression else np.argmax(preds, axis=1)

    if is_regression:
        return {"mse": ((preds - p.label_ids) ** 2).mean().item()}
    else:
        return {"accuracy": (preds == p.label_ids).astype(np.float32).mean().item()}

In [35]:
model = Wav2Vec2ForSpeechClassification.from_pretrained(
    model_name_or_path,
    config=config,
)

Downloading pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

Some weights of the model checkpoint at jonatasgrosman/wav2vec2-large-xlsr-53-russian were not used when initializing Wav2Vec2ForSpeechClassification: ['lm_head.weight', 'lm_head.bias']
- This IS expected if you are initializing Wav2Vec2ForSpeechClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForSpeechClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForSpeechClassification were not initialized from the model checkpoint at jonatasgrosman/wav2vec2-large-xlsr-53-russian and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this mode

In [36]:
model.freeze_feature_extractor()

In [37]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="/content/wav2vec2-xlsr-russian-gender-recognition",
    # output_dir="/content/gdrive/MyDrive/wav2vec2-xlsr-greek-speech-emotion-recognition"
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=2,
    evaluation_strategy="steps",
    num_train_epochs=1.0,
    fp16=False,
    save_steps=10,
    eval_steps=10,
    logging_steps=10,
    learning_rate=1e-4,
    save_total_limit=2,
)

In [38]:
from transformers import AutoTokenizer, AutoFeatureExtractor, AutoModelForCTC

tokenizer = AutoTokenizer.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-russian")
feature_extractor = AutoFeatureExtractor.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-russian")

In [39]:
feature_extractor.save_pretrained(training_args.output_dir)
tokenizer.save_pretrained(training_args.output_dir)
config.save_pretrained(training_args.output_dir)

In [40]:
!git clone https://github.com/NVIDIA/apex
%cd apex
!python3 setup.py install

Cloning into 'apex'...
remote: Enumerating objects: 11070, done.[K
remote: Counting objects: 100% (196/196), done.[K
remote: Compressing objects: 100% (117/117), done.[K
remote: Total 11070 (delta 108), reused 144 (delta 78), pack-reused 10874[K
Receiving objects: 100% (11070/11070), 15.34 MiB | 11.76 MiB/s, done.
Resolving deltas: 100% (7652/7652), done.
/content/apex
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'

 If your intention is to cross-compile, this is not an error.
By default, Apex will cross-compile for Pascal (compute capabilities 6.0, 6.1, 6.2),
Volta (compute capability 7.0), Turing (compute capability 7.5),
and, if the CUDA version is >= 11.0, Ampere (compute capability 8.0).
If you wish to cross-compile for a single specific architecture,
export TORCH_CUDA_ARCH_LIST="compute capability" before running setup.py.



torch.__version__  = 2.0.1+cu118


running install
!!

        ****************************************************************************

In [41]:
from typing import Any, Dict, Union

import torch
from packaging import version
from torch import nn

from transformers import (
    Trainer,
    is_apex_available,
)

if is_apex_available():
    from apex import amp

if version.parse(torch.__version__) >= version.parse("1.6"):
    _is_native_amp_available = True
    from torch.cuda.amp import autocast


class CTCTrainer(Trainer):
    def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) -> torch.Tensor:
        """
        Perform a training step on a batch of inputs.

        Subclass and override to inject custom behavior.

        Args:
            model (:obj:`nn.Module`):
                The model to train.
            inputs (:obj:`Dict[str, Union[torch.Tensor, Any]]`):
                The inputs and targets of the model.

                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
                argument :obj:`labels`. Check your model's documentation for all accepted arguments.

        Return:
            :obj:`torch.Tensor`: The tensor with training loss on this batch.
        """
        model.train()
        inputs = self._prepare_inputs(inputs)

        if self.use_cpu_amp:
            with autocast():
                loss = self.compute_loss(model, inputs)
        else:
            loss = self.compute_loss(model, inputs)

        if self.args.gradient_accumulation_steps > 1:
            loss = loss / self.args.gradient_accumulation_steps

        if self.use_cpu_amp:
            self.scaler.scale(loss).backward()
        elif self.use_apex:
            with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                scaled_loss.backward()
        elif self.deepspeed:
            self.deepspeed.backward(loss)
        else:
            loss.backward()

        return loss.detach()


In [42]:
trainer = CTCTrainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=processor.feature_extractor,
)

### Training

```javascript
function ConnectButton(){
    console.log("Connect pushed"); 
    document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click() 
}
setInterval(ConnectButton,60000);
```

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

<__main__.CTCTrainer at 0x7f92802ef760>

The training loss goes down and we can see that the Acurracy on the test set also improves nicely. Because this notebook is just for demonstration purposes, we can stop here.

The resulting model of this notebook has been saved to [m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition](https://huggingface.co/m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition)

As a final check, let's load the model and verify that it indeed has learned to recognize the emotion in the speech.

Let's first load the pretrained checkpoint.

## Evaluation

In [None]:
import librosa
from sklearn.metrics import classification_report

In [None]:
eval1= pd.read_csv('/content/eval.csv')
eval1['filename'] = eval1['filename'].apply(lambda x: x.replace('./', '/content/'))

In [None]:
eval1

Unnamed: 0.1,Unnamed: 0,informant,start,end,filename,corpus,text,gender,age,age_group
0,800,Sib_14-f,34.17,35.95,/content/trimmed/Sib_14-f_34.17_35.95.wav,standart,"я в субботу приезжаю,",f,69,elderly
1,801,Sib_10-m,70.85,75.21,/content/trimmed/Sib_10-m_70.85_75.21.wav,standart,∙∙∙ ээ ночью-у ∙∙ ǝподготовили ∙∙ мм ээ ’’ всё...,m,44,middle
2,802,Sib_10-m,82.28,83.70,/content/trimmed/Sib_10-m_82.28_83.7.wav,standart,∙∙ и-и полезли —,m,44,middle
3,803,Sib_02-f,102.19,103.97,/content/trimmed/Sib_02-f_102.19_103.97.wav,standart,∙∙∙∙ Вот.,f,20,young
4,804,Sib_13-f,127.80,129.50,/content/trimmed/Sib_13-f_127.8_129.5.wav,standart,что-о ∙∙∙ так...,f,68,elderly
...,...,...,...,...,...,...,...,...,...,...
95,895,Sib_16-m,67.03,71.23,/content/trimmed/Sib_16-m_67.03_71.23.wav,standart,"∙∙∙∙ Та-ак,",m,69,elderly
96,896,Sib_04-m,71.91,72.90,/content/trimmed/Sib_04-m_71.91_72.9.wav,standart,∙∙∙ работать.,m,20,young
97,897,Sib_03-m,86.51,87.81,/content/trimmed/Sib_03-m_86.51_87.81.wav,standart,"∙∙∙ но ещё лучше,",m,19,young
98,898,Sib_08-f,38.60,41.50,/content/trimmed/Sib_08-f_38.6_41.5.wav,standart,ну с Люком много у нас было ∙∙ интересных исто...,f,48,elderly


In [None]:
eval1.to_csv('/content/eval1.csv')

In [None]:

test_dataset = load_dataset("csv", data_files={'eval': '/content/eval1.csv'})['eval']
test_dataset

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-af1ccf4e37d9c3a3/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating eval split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-af1ccf4e37d9c3a3/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['Unnamed: 0.1', 'Unnamed: 0', 'informant', 'start', 'end', 'filename', 'corpus', 'text', 'gender', 'age', 'age_group'],
    num_rows: 100
})

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

Device: cuda


In [None]:
model_name_or_path = "/content/wav2vec2-xlsr-russian-gender-recognition/checkpoint-50"
config = AutoConfig.from_pretrained(model_name_or_path)
processor = Wav2Vec2Processor.from_pretrained('/content/wav2vec2-xlsr-russian-gender-recognition')
model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)

In [None]:
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["filename"])
    speech_array = speech_array.squeeze().numpy()
    #speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, processor.feature_extractor.sampling_rate)

    batch["speech"] = speech_array
    return batch

def predict(batch):
    features = processor(batch["speech"], sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt", padding=True)

    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)

    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits 

    pred_ids = torch.argmax(logits, dim=-1).detach().cpu().numpy()
    batch["predicted"] = pred_ids
    return batch

In [None]:
import os
for i in os.listdir('/content/trimmed'):
  if i == 'Sib_12-f_104.1_105.15.wav':
    print('found')

found


In [None]:
!rm -r /content/trimmed

In [None]:
!unzip /content/trimmed_2.zip

/content/apex


In [None]:
test_dataset = test_dataset.map(speech_file_to_array_fn)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [None]:
result = test_dataset.map(predict, batched=True, batch_size=8)

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

In [None]:
label_names = [config.id2label[i] for i in range(config.num_labels)]
label_names

['elderly', 'middle', 'young']

In [None]:
y_true = [config.label2id[name] for name in result["age_group"]]
y_pred = result["predicted"]

print(y_true[:5])
print(y_pred[:5])

[0, 1, 1, 2, 0]
[0, 1, 1, 2, 2]


In [None]:
print(classification_report(y_true, y_pred, target_names=label_names))

              precision    recall  f1-score   support

     elderly       0.97      0.70      0.81        46
      middle       0.88      0.78      0.82        27
       young       0.60      0.96      0.74        27

    accuracy                           0.79       100
   macro avg       0.82      0.81      0.79       100
weighted avg       0.85      0.79      0.80       100

