# Пример работы с моделями открытых репозиториев:

## - Hugginface:

1. **Запуск моделей**:
    - загрузить биилиотеки для работы: requirements.txt
    - пройти на страницу моделей Hugginface: https://huggingface.co/models
    - выбрать теги для работы. наши теги по задачам:
        - Speech to text (STT): Audio и там automatic-speech-recognition, в "Filter by name" в меру воображения про русский язык (например, просто пишем  - ru )
        - Text Sentiment Analize (TSA): Natural Language Processing и там text-classification, а  в "Filter by name" в меру воображения про русский язык (например, просто пишем  - ru ) 
        - Speech to Emotion recognition (SER): Audio и там audio-classification, в "Filter by name" в меру воображения про русский язык (например, просто пишем  - ru ) - замечание : тут много моделей не про речь - читаем описание моделей и корректируем выбор.
    - Собираем список имен моделей:
        - открыть модель из списка оставшихся рабочих образцов со страницы 
        - запустить модель с примером из датасета (небольшое тестовое множество)
        - записать выходы и проверить адекватность записываемого 
    - Отбросить модели, которые решают не "нашу" задачу (не всегда прозрачно описан выход модели), и модели, которые не заработала (так тоже бывает)
 
 2. **Тест**:
    - Не верим результатам из карты модели - проверяем, но и себе не верим (проверяем)
    - Запускаем "Уцелевший список рабочих моделей" и записываем результаты в один параллельно заполняемый Дата фрейм
    - собрать таблицу результатов работы модели (дописать выходы модели к таблице с датасетом)
 
 3. **Анализ**:
    - провести очистку и интерпретицию результатов (модели не обязаны писать ответы как нам нужно, они пищут их как у них записано)
    - тестим на разных данных (датасеты, тоже имеют свою разметку)
    - пользуемся одной метрикой!!!!
 
 4. **Выводы**       
        

Пример для юболее продвинутого варианта использования моделей-  с доучиванием https://github.com/huggingface/notebooks/blob/main/examples/audio_classification.ipynb

## Audio classification

In [1]:

import pandas as pd
import numpy as np
import time

import warnings
warnings.simplefilter('ignore')


## инсталировать нужные пакеты

In [2]:
# # для установки torch (похоже tensorflow если используем его)
# !pip install torch torchvision
# # общий вариант
# !pip install transformers 
# # вариант с только с цпу
# # for torch
# !pip install 'transformers[torch]'
# # for tensorflow
# !pip install 'transformers[tf-cpu]'

# # другие варианты https://huggingface.co/docs/transformers/installation

In [3]:
# пакет от Hugginface

In [4]:
from transformers import pipeline


In [5]:
# серилизация объектов

In [6]:
import pickle

## Датасеты

  - **RESD** : 7 классов
  - https://huggingface.co/datasets/Aniemore/resd_annotated


In [7]:
path_resd_train = '../dataset/data_RESD.pickle'


with open(path_resd_train, 'rb') as f:
    # The protocol version used is detected automatically, so we do not
    # have to specify it.
    data = pickle.load(f)

In [8]:
data[0]

{'name': '32_happiness_enthusiasm_h_120',
 'path': 'happiness_enthusiasm_32/32_happiness_enthusiasm_h_120.wav',
 'emotion': 'happiness',
 'speech': {'path': '32_happiness_enthusiasm_h_120.wav',
  'array': array([-0.00018311, -0.00061035, -0.00076294, ...,  0.00085449,
          0.00048828,  0.00030518]),
  'sampling_rate': 16000}}

In [9]:
df_resd_train = pd.DataFrame(data)
df_resd_train.head()

Unnamed: 0,name,path,emotion,speech
0,32_happiness_enthusiasm_h_120,happiness_enthusiasm_32/32_happiness_enthusias...,happiness,"{'path': '32_happiness_enthusiasm_h_120.wav', ..."
1,36_disgust_happiness_d_130,disgust_happiness_36/36_disgust_happiness_d_13...,disgust,"{'path': '36_disgust_happiness_d_130.wav', 'ar..."
2,34_anger_fear_a_060,anger_fear_34/34_anger_fear_a_060.wav,anger,"{'path': '34_anger_fear_a_060.wav', 'array': [..."
3,25_anger_disgust_a_010,anger_disgust_25/25_anger_disgust_a_010.wav,anger,"{'path': '25_anger_disgust_a_010.wav', 'array'..."
4,17_neutral_disgust_d_092,neutral_disgust_17/17_neutral_disgust_d_092.wav,disgust,"{'path': '17_neutral_disgust_d_092.wav', 'arra..."


In [10]:
# from https://www.kaggle.com/datasets/ar4ikov/resd-dataset?resource=download

In [11]:
path_resd_test  = '../dataset/RESD_csv/test.csv'

In [12]:
df_resd = pd.read_csv(path_resd_test )
df_resd.head()

Unnamed: 0,name,path,emotion,text
0,27_neutral_fear_n_100,neutral_fear_27/27_neutral_fear_n_100.wav,neutral,"Вам дадут целый минимальный оклад, но при этом..."
1,08_sadness_anger a_010,08_sadness_anger/08_sadness_anger a_010.wav,anger,Сколько можно звонить?
2,26_enthusiasm_happiness_e_120,enthusiasm_happiness_26/26_enthusiasm_happines...,enthusiasm,А как долго тебе нужно это всё узнавать?
3,42_anger_fear_a_190,anger_fear_42/42_anger_fear_a_190.wav,anger,Ну а мне в 5 часов вставать на работу!
4,04_fear_enthusiasm f_090,04_fear_enthusiasm/04_fear_enthusiasm f_090.wav,fear,"Честно, я не подскажу, ну как и обычно, любым ..."


In [13]:
df_resd.emotion.value_counts()

emotion
fear          45
anger         44
happiness     44
enthusiasm    40
neutral       38
disgust       37
sadness       32
Name: count, dtype: int64

- сбалансирован

   - **DUSHA** : 5 классов
   - https://github.com/salute-developers/golos/tree/master/dusha#dusha-dataset

In [14]:
df_dusha = pd.read_csv('../dataset/dusha/podcast_train/raw_podcast_train.tsv',  sep = '	')
df_dusha.head()

Unnamed: 0,hash_id,audio_path,duration,annotator_emo,golden_emo,annotator_id,speaker_text,speaker_emo,source_id
0,857b7099a4f5766105d166e2283066fa,wavs/857b7099a4f5766105d166e2283066fa.wav,4.4,neutral,,a6aea16a81aa926eee405c0878162c91,,,d6738a1e0d59f783987e0503ddb4ca54
1,2107b749055d85d7c09ac49fd30e3feb,wavs/2107b749055d85d7c09ac49fd30e3feb.wav,3.8,neutral,2.0,a6aea16a81aa926eee405c0878162c91,,,1fcfcacf584841d22fbdf4a51fe6177d
2,700b3a5644a0824831848c346d11c7d6,wavs/700b3a5644a0824831848c346d11c7d6.wav,2.5,neutral,,a6aea16a81aa926eee405c0878162c91,,,dfd63e80a7aca8d4cb4e14e062441886
3,e8c053899135f139e9527c1388790e36,wavs/e8c053899135f139e9527c1388790e36.wav,1.7,neutral,,a6aea16a81aa926eee405c0878162c91,,,63c3ae005d1663de92314be4377a8805
4,7fe59996e0f93b8a63e28aacf480004b,wavs/7fe59996e0f93b8a63e28aacf480004b.wav,1.9,neutral,,a6aea16a81aa926eee405c0878162c91,,,14abbe44c78118171f01348f862698cd


In [15]:
df_dusha.shape

(645813, 9)

In [16]:
df_dusha.annotator_emo.value_counts()

annotator_emo
neutral     579685
positive     37366
sad          15100
angry        11685
other         1977
Name: count, dtype: int64

- разбалансирован

In [17]:
path_wav = '../dataset/dusha/podcast_train/'

# Запуск модели c Hugginface

## 1. **Запуск моделей** 

+chrisjay/afrospeech-wav2vec-run : https://huggingface.co/chrisjay/afrospeech-wav2vec-run?library=true

+Aniemore/wavlm-emotion-russian-resd : https://huggingface.co/Aniemore/wavlm-emotion-russian-resd?library=true

+"Aniemore/hubert-emotion-russian-resd": https://huggingface.co/Aniemore/hubert-emotion-russian-resd?library=true

+KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru : https://huggingface.co/KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru?library=true

-"xbgoose/hubert-base-speech-emotion-recognition-russian-dusha-finetuned" : https://huggingface.co/xbgoose/hubert-base-speech-emotion-recognition-russian-dusha-finetuned?library=true

-"xbgoose/wavlm-large-speech-emotion-recognition-russian-dusha-finetuned" : https://huggingface.co/xbgoose/wavlm-large-speech-emotion-recognition-russian-dusha-finetuned?library=true

+"ruisp/hubert-base-ls960-finetuned-gtzan" : https://huggingface.co/ruisp/hubert-base-ls960-finetuned-gtzan

+"Aniemore/wav2vec2-emotion-russian-resd" : https://huggingface.co/Aniemore/wav2vec2-emotion-russian-resd?library=true

+"Aniemore/unispeech-sat-emotion-russian-resd" : https://huggingface.co/Aniemore/unispeech-sat-emotion-russian-resd?library=true

+"justin1983/wav2vec2-large-xlsr-53-russian-finetuned-amd" : https://huggingface.co/justin1983/wav2vec2-large-xlsr-53-russian-finetuned-amd

-"xbgoose/hubert-large-speech-emotion-recognition-russian-dusha-finetuned" : https://huggingface.co/xbgoose/hubert-large-speech-emotion-recognition-russian-dusha-finetuned

-"xbgoose/wavlm-base-speech-emotion-recognition-russian-dusha-finetuned" : https://huggingface.co/xbgoose/wavlm-base-speech-emotion-recognition-russian-dusha-finetuned

-"ArinaOwl/ast-ser-ru" : https://huggingface.co/ArinaOwl/ast-ser-ru

+/- заработала/нет

### Список рабочих вариантов

In [18]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
model_name = "anton-l/wav2vec2-large-xlsr-53-russian"#"openai/whisper-tiny"
model_name =  "jonatasgrosman/wav2vec2-xls-r-1b-russian"
model_name =  "bond005/wav2vec2-large-ru-golos"
model_name = "lorenzoncina/whisper-medium-ru" ## ****
model_name = "Shirali/whisper-small-ru"

pipe = pipeline("automatic-speech-recognition", model=model_name, trust_remote_code=True)


pipe('01_happiness_anger a_020.wav')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


{'text': 'Слушай, я потратил при этом кучу денег для того чтобы притащиться в эту дру, это что вообще такое? Посмотри на официантов, они все в чёрных каких-то рубашках с кислыми минами, даже никто из них до сих пор не подошёл к нам.'}

## 2. Tест моделей :

    - модели из списка уже проверены на то, что запускаются и работают (проверены прямо в этом же коде для небольшого числа примеров), т.е. теперь просто запускаем модель для получения ответа (в привычной для нее форме)

    - К сожалению, ответы не гарантированно одинаково построены (хотя и имеют для Hugginface общую структуру). Например, Aniemore - сортирует ответы в порядке качества, а KELONMYOSA - нет. Просто внимательно смотрим , что вернулось

In [19]:
from optimum.onnxruntime import ORTModelForQuestionAnswering, ORTModelForAudioClassification, ORTModelForSequenceClassification, ORTModelForSpeechSeq2Seq
from transformers import pipeline
import time
import librosa
import numpy as np
import os

class  SpeechtoText():
    def __init__(self, model_name):
        
        self.pipe = pipeline("automatic-speech-recognition", model=model_name, trust_remote_code=True)
        

    def postprocess(self, output):
        return output['text']
        
    def run(self, path):
        rez = self.pipe(path)
        
        return self.postprocess(rez)

In [20]:
import glob

path_mix_dn = '../шумоподавление/DTLN_output/*.wav'
path_mix_n = '../шумоподавление/Noisereduce_output/*.wav'
list_mix_dn = glob.glob(path_mix_dn)
len(list_mix_dn)


783

In [21]:
list_mix_n = glob.glob(path_mix_n)
len(list_mix_n)

783

In [22]:
df_text = pd.read_csv('../dataset/RESD_csv/train.csv')
df_text.head()


Unnamed: 0,name,path,emotion,text
0,32_happiness_enthusiasm_h_120,happiness_enthusiasm_32/32_happiness_enthusias...,happiness,"Конечно, расскажу, обязательно. Ой, сейчас рас..."
1,36_disgust_happiness_d_130,disgust_happiness_36/36_disgust_happiness_d_13...,disgust,Вы ещё и профессию решили поменять.
2,34_anger_fear_a_060,anger_fear_34/34_anger_fear_a_060.wav,anger,"Ты знаешь, чем это для тебя закончится?"
3,25_anger_disgust_a_010,anger_disgust_25/25_anger_disgust_a_010.wav,anger,Добрый день. Вы хотели бы приобрести недвижимо...
4,17_neutral_disgust_d_092,neutral_disgust_17/17_neutral_disgust_d_092.wav,disgust,"все ваши рекламные акции, пожалуйста, больше н..."


In [23]:
mix_project_path = '../TZ_noisa/'
mix_file = 'mix.csv'
fd = pd.read_csv(mix_project_path + mix_file)
fd.head()

Unnamed: 0,audio_path,noise_path,mix_path,bitrate,duration,mix_method,volume_sound,volume_noise
0,train/enthusiasm_neutral_43/43_enthusiasm_neut...,noise/-391904074655300970.wav,mix_test/43_enthusiasm_neutral_e_080.wav,16000,00:04.6,1,8.428574,1.811726
1,test/anger_fear_42/42_anger_fear_f_070.wav,noise/1248824479100550917.wav,mix_test/42_anger_fear_f_070.wav,16000,00:04.3,1,0.214053,6.887027
2,train/disgust_happiness_36/36_disgust_happines...,noise/-392355194681447233.wav,mix_test/36_disgust_happiness_d_100.wav,16000,00:02.2,1,9.884556,9.882694
3,train/enthusiasm_sadness_15/15_enthusiasm_sadn...,noise/1147034587767574128.wav,mix_test/15_enthusiasm_sadness_s_031.wav,16000,00:07.3,1,3.9656,7.601287
4,train/fear_disgust_48/48_fear_disgust_d_100.wav,noise/1279042682565365761.wav,mix_test/48_fear_disgust_d_100.wav,16000,00:06.7,1,2.940736,1.46258


In [24]:
path_buf =  '../dataset/RESD_csv/'
list_rez = []
list_problem = []
sampling_rate = 16000
model_name = ["anton-l/wav2vec2-large-xlsr-53-russian", #"jonatasgrosman/wav2vec2-xls-r-1b-russian", 
               "bond005/wav2vec2-large-ru-golos", 
              # "lorenzoncina/whisper-medium-ru" ,
              "Shirali/whisper-small-ru"]
for model in model_name:
    model_text = SpeechtoText(model)
    for name in list_mix_dn[:100]:
        try:
            name_file = name.split('/')[-1]
            name_n = '../шумоподавление/Noisereduce_output/Noisereduce_' + '_'.join(name_file.split('_')[1:])
        
            name_mix = 'mix_test/' + '_'.join(name_file.split('_')[1:])
            
            name_ideal = fd.loc[fd.mix_path == name_mix, :].values
            
            trek_name = name_ideal[0][0].split('/')[-1].split('.')[0]
            # print(name_ideal, trek_name )
            name_em_ideal = df_text.loc[df_text.iloc[:,0] == trek_name, ['emotion', 	'text']].values[0]
            em_ideal, text_ideal = name_em_ideal
            
            # print(name, name_n, name_ideal)
            path_a1 =path_buf + name_ideal[0][0]
            t1 = time.time()
            speech, sr = librosa.load(path_a1, sr=sampling_rate)
        
            rez_text = model_text.run(speech)
        
            path_a2 = name_n
            t1 = time.time()
            speech, sr = librosa.load(path_a2, sr=sampling_rate)
            
            rez_text_n = model_text.run(speech) 
        
            path_a3 = name
            t1 = time.time()
            speech, sr = librosa.load(path_a3, sr=sampling_rate)
            
            rez_text_dn = model_text.run(speech)
            # print(path_a1, rez_text, '\n',path_a3, rez_text_dn[0], '\n', path_a2, rez_text_n[0])
            list_rez.append([model, path_a1,path_a3, path_a2, rez_text, rez_text_dn, rez_text_n, text_ideal, em_ideal ])
            print(end='.')
        except:
            # print('error: ',name)
            list_problem.append([model, name])
            

Some weights of the model checkpoint at anton-l/wav2vec2-large-xlsr-53-russian were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at anton-l/wav2vec2-large-xlsr-53-russian and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should prob

Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
............................................................

Some weights of the model checkpoint at bond005/wav2vec2-large-ru-golos were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_v', 'wav2vec2.encoder.pos_conv_embed.conv.weight_g']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at bond005/wav2vec2-large-ru-golos and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN thi

Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
Ignored unknown kwarg option normalize
............................................................

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


............................................................

In [26]:
df_rez = pd.DataFrame(list_rez, columns=['model','name', 'name2', 'name2', 'ideal', 'dn', 'n', 'text', 'em',])
df_rez.head()

Unnamed: 0,model,name,name2,name2.1,ideal,dn,n,text,em
0,anton-l/wav2vec2-large-xlsr-53-russian,../dataset/RESD_csv/train/sadness_anger_39/39_...,../шумоподавление/DTLN_output/DTLN_39_sadness_...,../шумоподавление/Noisereduce_output/Noiseredu...,ну он уже не дышат все что ом не дихать,имому он же не упелетшат ысокачто в нейти ати,дему он уже прешать все ключтон и тиати,Он уже не дышит. Что мне делать?,sadness
1,anton-l/wav2vec2-large-xlsr-53-russian,../dataset/RESD_csv/train/sadness_happiness_49...,../шумоподавление/DTLN_output/DTLN_49_sadness_...,../шумоподавление/Noisereduce_output/Noiseredu...,я сейчас помою и отдам тебе если тебе что то н...,я сейчас помою и отдам тебе если тебе что то н...,я сейчас помою и отдамцяесли тебе что то не ус...,"Я сейчас помою и отдам тебе, если тебя что-то ...",happiness
2,anton-l/wav2vec2-large-xlsr-53-russian,../dataset/RESD_csv/train/02_anger_sadness/02_...,../шумоподавление/DTLN_output/DTLN_02_anger_sa...,../шумоподавление/Noisereduce_output/Noiseredu...,не ужили дольшь анна ивановна семьдесят лет он...,не ужили дольше анна ивановнаявлет всемьдесятв...,не ужил дольшеанна ивановска семьдесят леттаб ...,Неужели дольше? Анна Ивановна 70 лет. Она дела...,anger
3,anton-l/wav2vec2-large-xlsr-53-russian,../dataset/RESD_csv/train/sadness_disgust_33/3...,../шумоподавление/DTLN_output/DTLN_33_sadness_...,../шумоподавление/Noisereduce_output/Noiseredu...,манали что то захотил,малады ресчитыт за потел,нада решь ит это потил,Мало ли что ты захотел,disgust
4,anton-l/wav2vec2-large-xlsr-53-russian,../dataset/RESD_csv/train/05_neutral_fear/05_n...,../шумоподавление/DTLN_output/DTLN_05_neutral_...,../шумоподавление/Noisereduce_output/Noiseredu...,да они всегда вечером плохо ходят,и всегда вечером плохо ходит,да они всегда вечером плохо ходет,"Да, они всегда вечером плохо ходят.",neutral


### Сборка ответов в датафрейм

In [27]:
df_rez.to_csv('speech_noise_text.csv')

## 3. Очистка и анализ


 - Загружаем датафрейм с результатами

In [28]:
def calculate_wer(reference, hypothesis):
	ref_words = reference.split()
	hyp_words = hypothesis.split()
	# Counting the number of substitutions, deletions, and insertions
	substitutions = sum(1 for ref, hyp in zip(ref_words, hyp_words) if ref != hyp)
	deletions = len(ref_words) - len(hyp_words)
	insertions = len(hyp_words) - len(ref_words)
	# Total number of words in the reference text
	total_words = len(ref_words)
	# Calculating the Word Error Rate (WER)
	wer = (substitutions + deletions + insertions) / total_words
	return wer

In [31]:
df_rez = pd.read_csv('speech_noise_text.csv', index_col=0)
df_rez.head()

Unnamed: 0,model,name,name2,name2.1,ideal,dn,n,text,em
0,anton-l/wav2vec2-large-xlsr-53-russian,../dataset/RESD_csv/train/sadness_anger_39/39_...,../шумоподавление/DTLN_output/DTLN_39_sadness_...,../шумоподавление/Noisereduce_output/Noiseredu...,ну он уже не дышат все что ом не дихать,имому он же не упелетшат ысокачто в нейти ати,дему он уже прешать все ключтон и тиати,Он уже не дышит. Что мне делать?,sadness
1,anton-l/wav2vec2-large-xlsr-53-russian,../dataset/RESD_csv/train/sadness_happiness_49...,../шумоподавление/DTLN_output/DTLN_49_sadness_...,../шумоподавление/Noisereduce_output/Noiseredu...,я сейчас помою и отдам тебе если тебе что то н...,я сейчас помою и отдам тебе если тебе что то н...,я сейчас помою и отдамцяесли тебе что то не ус...,"Я сейчас помою и отдам тебе, если тебя что-то ...",happiness
2,anton-l/wav2vec2-large-xlsr-53-russian,../dataset/RESD_csv/train/02_anger_sadness/02_...,../шумоподавление/DTLN_output/DTLN_02_anger_sa...,../шумоподавление/Noisereduce_output/Noiseredu...,не ужили дольшь анна ивановна семьдесят лет он...,не ужили дольше анна ивановнаявлет всемьдесятв...,не ужил дольшеанна ивановска семьдесят леттаб ...,Неужели дольше? Анна Ивановна 70 лет. Она дела...,anger
3,anton-l/wav2vec2-large-xlsr-53-russian,../dataset/RESD_csv/train/sadness_disgust_33/3...,../шумоподавление/DTLN_output/DTLN_33_sadness_...,../шумоподавление/Noisereduce_output/Noiseredu...,манали что то захотил,малады ресчитыт за потел,нада решь ит это потил,Мало ли что ты захотел,disgust
4,anton-l/wav2vec2-large-xlsr-53-russian,../dataset/RESD_csv/train/05_neutral_fear/05_n...,../шумоподавление/DTLN_output/DTLN_05_neutral_...,../шумоподавление/Noisereduce_output/Noiseredu...,да они всегда вечером плохо ходят,и всегда вечером плохо ходит,да они всегда вечером плохо ходет,"Да, они всегда вечером плохо ходят.",neutral


    - Список рабочих моделей (для которых есть результаты)

In [32]:
list_ok_model = set(df_rez['model'].unique())
list_ok_model

{'Shirali/whisper-small-ru',
 'anton-l/wav2vec2-large-xlsr-53-russian',
 'bond005/wav2vec2-large-ru-golos'}

In [33]:
len(list_ok_model)

3

    - Размер данных для анализа:

In [34]:
df_rez.shape

(180, 9)

In [35]:
df_rez.em.value_counts()

em
happiness     42
sadness       36
disgust       30
fear          27
enthusiasm    21
neutral       18
anger          6
Name: count, dtype: int64

In [36]:

for model_name in list_ok_model:
    df = df_rez.loc[df_rez.model == model_name,:]
    print(model_name, '****************')
    print(df.shape[0])

anton-l/wav2vec2-large-xlsr-53-russian ****************
60
bond005/wav2vec2-large-ru-golos ****************
60
Shirali/whisper-small-ru ****************
60


### Оценка метрик:WER

In [48]:
isinstance

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [37]:
df_rez.columns

Index(['model', 'name', 'name2', 'name2.1', 'ideal', 'dn', 'n', 'text', 'em'], dtype='object')

In [63]:
wer_rez = []
for model_name in list_ok_model:
    df = df_rez.loc[df_rez.model == model_name,:]
    print(model_name, '****************')
    
    for i in range(df.shape[0]):
        tt = df.iloc[i,4:8].values
        reference =  tt[-1]
        list_a = []
        for k in range(3):
            hypothesis = tt[k] if isinstance(tt[k], type('str'))  else ''
            list_a.append(calculate_wer(reference, hypothesis))
        wer_rez.append([model_name] +list_a)


anton-l/wav2vec2-large-xlsr-53-russian ****************
bond005/wav2vec2-large-ru-golos ****************
Shirali/whisper-small-ru ****************


In [64]:
df_wer =  pd.DataFrame(wer_rez, columns=['model', 'ideal', 'dn', 'n'])
df_wer

Unnamed: 0,model,ideal,dn,n
0,anton-l/wav2vec2-large-xlsr-53-russian,1.000000,1.000000,1.000000
1,anton-l/wav2vec2-large-xlsr-53-russian,0.642857,0.642857,0.714286
2,anton-l/wav2vec2-large-xlsr-53-russian,1.000000,0.937500,0.937500
3,anton-l/wav2vec2-large-xlsr-53-russian,0.800000,0.800000,1.000000
4,anton-l/wav2vec2-large-xlsr-53-russian,0.333333,0.833333,0.333333
...,...,...,...,...
175,Shirali/whisper-small-ru,0.909091,0.727273,1.000000
176,Shirali/whisper-small-ru,1.000000,0.500000,1.000000
177,Shirali/whisper-small-ru,0.454545,0.727273,0.727273
178,Shirali/whisper-small-ru,0.909091,0.818182,1.000000


In [65]:
WER_REZ = []
df_wer = df_wer.dropna()
for name in list_ok_model:
    mean_model = df_wer.loc[df_wer.model == name, ['ideal', 'dn', 'n']].mean(axis=0)
    WER_REZ.append([name] + mean_model.tolist() + [df_wer.loc[df_wer.model == name, ['ideal', 'dn', 'n']].shape[0]])

In [66]:
pd.DataFrame(WER_REZ, columns=['model', 'ideal', 'dn', 'n', 'N'])

Unnamed: 0,model,ideal,dn,n,N
0,anton-l/wav2vec2-large-xlsr-53-russian,0.717904,0.725476,0.715559,60
1,bond005/wav2vec2-large-ru-golos,0.630663,0.571637,0.545204,60
2,Shirali/whisper-small-ru,0.475716,0.529243,0.564664,60


Более адекватный вариант:

In [36]:
from sklearn.metrics import classification_report, f1_score, accuracy_score

In [37]:
print(classification_report(df.ground.values, df.new_m0.values))

              precision    recall  f1-score   support

       angry       0.88      0.90      0.89        31
     neutral       0.88      0.93      0.91       258
       other       0.75      0.60      0.67         5
    positive       0.92      0.73      0.81        15
         sad       0.91      0.85      0.88       193

    accuracy                           0.89       502
   macro avg       0.87      0.80      0.83       502
weighted avg       0.89      0.89      0.89       502



In [38]:
f1_score(df.ground.values , df.new_m0.values, average='weighted')

0.8896643196605611

In [39]:
rez_model_score = []
columns_rez = ['model','dataset', 'acc0', 'acc1', 'f1', 't']
for model_name in model_list:
    df = df_out.loc[df_out.model == model_name,:]
    
    acc0 = accuracy_score(df.ground.values , df.new_m0.values)
    acc1 = accuracy_score(df.ground.values , df.new_m1.values)
    f1 = f1_score(df.ground.values , df.new_m0.values, average='weighted')
    print(model_name, '****************')
    rez_model_score.append([model_name, 'DUSHA', acc0, acc1, f1 , np.mean(df.t)])
rez_model_score_pd = pd.DataFrame(rez_model_score, columns=columns_rez)
rez_model_score_pd.head()

Aniemore/wav2vec2-emotion-russian-resd ****************
Aniemore/hubert-emotion-russian-resd ****************
Aniemore/wavlm-emotion-russian-resd ****************
Aniemore/unispeech-sat-emotion-russian-resd ****************
KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru ****************


Unnamed: 0,model,dataset,acc0,acc1,f1,t
0,Aniemore/wav2vec2-emotion-russian-resd,DUSHA,0.101594,0.149402,0.110919,2.77051
1,Aniemore/hubert-emotion-russian-resd,DUSHA,0.217131,0.436255,0.305991,2.530516
2,Aniemore/wavlm-emotion-russian-resd,DUSHA,0.229084,0.322709,0.319163,1.988178
3,Aniemore/unispeech-sat-emotion-russian-resd,DUSHA,0.356574,0.342629,0.430552,2.920435
4,KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru,DUSHA,0.890438,0.02988,0.889664,2.659251


## 4. Вывод

    - Аccuracy для KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru - дал лучший результат, но он учился для этого датасета - результат сопоставим с результатом в карточке модели
    - Аccuracy для Aniemore/unispeech-sat-emotion-russian-resd - лучший из вариантов Aniemore
    - нужно проверить KELONMYOSA на resd

In [40]:
df_out.loc[df_out.model == model_list[2],:].head()

Unnamed: 0,model,wav,ground,m0,s0,m1,s1,t,new_m0,new_m1
0,Aniemore/wavlm-emotion-russian-resd,dataset/dusha/podcast_train/wavs/eac47d2886774...,sad,fear,0.527574,sadness,0.464834,1.988178,other,sad
1,Aniemore/wavlm-emotion-russian-resd,dataset/dusha/podcast_train/wavs/0f6231dcaeb74...,neutral,fear,0.722375,sadness,0.276071,1.988178,other,sad
2,Aniemore/wavlm-emotion-russian-resd,dataset/dusha/podcast_train/wavs/bee2345241659...,neutral,sadness,0.749593,fear,0.248569,1.988178,sad,other
3,Aniemore/wavlm-emotion-russian-resd,dataset/dusha/podcast_train/wavs/a4b65036044fc...,angry,happiness,0.999034,enthusiasm,0.000642,1.988178,positive,positive
4,Aniemore/wavlm-emotion-russian-resd,dataset/dusha/podcast_train/wavs/2124ca527a2db...,neutral,fear,0.999496,enthusiasm,0.000248,1.988178,other,positive


## 5. Содержание (граф модели, оазюор основного цикла исполнения):

    - создаем модель:

In [41]:
classifier = pipeline("audio-classification", model=model_list[2], trust_remote_code=True)

Some weights of the model checkpoint at Aniemore/wavlm-emotion-russian-resd were not used when initializing WavLMForSequenceClassification: ['wavlm.encoder.pos_conv_embed.conv.weight_g', 'wavlm.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing WavLMForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing WavLMForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of WavLMForSequenceClassification were not initialized from the model checkpoint at Aniemore/wavlm-emotion-russian-resd and are newly initialized: ['wavlm.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wavlm.encoder.pos_conv_embed.conv.param

    - открываем основную часть модели:

In [42]:
classifier.model

WavLMForSequenceClassification(
  (wavlm): WavLMModel(
    (feature_extractor): WavLMFeatureEncoder(
      (conv_layers): ModuleList(
        (0): WavLMLayerNormConvLayer(
          (conv): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False)
          (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation): GELUActivation()
        )
        (1-4): 4 x WavLMLayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
          (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation): GELUActivation()
        )
        (5-6): 2 x WavLMLayerNormConvLayer(
          (conv): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
          (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (activation): GELUActivation()
        )
      )
    )
    (feature_projection): WavLMFeatureProjection(
      (layer_norm): LayerNorm((512,), eps=1e-05, element

### препроцессинг данных

In [44]:
tr_pt = classifier.preprocess('dataset/RESD_train/032_happiness_enthusiasm_h_120')
tr_pt

{'input_values': tensor([[-0.0036, -0.0134, -0.0169,  ...,  0.0203,  0.0119,  0.0077]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]], dtype=torch.int32)}

### исполнение модели

In [45]:
tr_model = classifier.model(tr_pt['input_values'], tr_pt['attention_mask'])
tr_model

SequenceClassifierOutput(loss=None, logits=tensor([[-0.7961, -1.9680, -0.8598, -1.5010,  8.6319, -3.1188, -1.7019]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

### постпроцессинг

In [46]:
tr_out = classifier.postprocess(tr_model)
tr_out

[{'score': 0.9997391104698181, 'label': 'happiness'},
 {'score': 8.041437104111537e-05, 'label': 'anger'},
 {'score': 7.54516659071669e-05, 'label': 'enthusiasm'},
 {'score': 3.9740101783536375e-05, 'label': 'fear'},
 {'score': 3.250683221267536e-05, 'label': 'sadness'}]

In [69]:
## 7.Кросс-проверка на RESD

In [47]:
df_resd.head()

Unnamed: 0,name,path,emotion,text
0,27_neutral_fear_n_100,neutral_fear_27/27_neutral_fear_n_100.wav,neutral,"Вам дадут целый минимальный оклад, но при этом..."
1,08_sadness_anger a_010,08_sadness_anger/08_sadness_anger a_010.wav,anger,Сколько можно звонить?
2,26_enthusiasm_happiness_e_120,enthusiasm_happiness_26/26_enthusiasm_happines...,enthusiasm,А как долго тебе нужно это всё узнавать?
3,42_anger_fear_a_190,anger_fear_42/42_anger_fear_a_190.wav,anger,Ну а мне в 5 часов вставать на работу!
4,04_fear_enthusiasm f_090,04_fear_enthusiasm/04_fear_enthusiasm f_090.wav,fear,"Честно, я не подскажу, ну как и обычно, любым ..."


In [134]:
df_resd.shape

(280, 4)

### подгоняем данные под удобный нам вариант (сохраняем в файлы)

In [37]:
# import scipy.io.wavfile as wavf
# import numpy as np


 
# for i in range(df_resd.shape[0]):
#     try:
#         wav_name = df_resd.name.iloc[i]
#         speech = df_resd.speech.iloc[i]
#         samples = speech['array']
#         fs = speech['sampling_rate']
#         out_f = 'dataset/RESD/' + str(i) + wav_name
#         wavf.write(out_f, fs, samples)
#     except:
#         print('error')
   

In [None]:
# rez_resd = [] 
# for model_name in model_list:
#     try:
#         t1 = time.time()
#         classifier = pipeline("audio-classification", model=model_name, trust_remote_code=True)
#         t1 = time.time() -  t1
#         for i in range(0,df_resd.shape[0],3):
#             try:
#                 wav_name = df_resd.name.iloc[i]
#                 # speech = df_resd.speech.iloc[i]
#                 # samples = speech['array']
#                 # fs = speech['sampling_rate']
#                 out_f =  'dataset/RESD/' + str(i) + wav_name
#                 t1 = time.time()
#                 label_pred = classifier(out_f)
#                 t1 = time.time() -  t1
                
#                 rez_model_s = [ ss['score'] for ss in label_pred]
#                 k = np.argmax(rez_model_s)
                
#                 label_true = df_resd.emotion.iloc[i]
#                 rez_resd.append([i, wav_name, label_pred[k]['label'],label_true,label_pred, t1 ]) 
#                 # break
#             except:
#                 pass
#     except:
#         print('model error')

### делаем анализ

In [104]:
     
rez_resd = [] 
for model_name in model_list:
    try:
        t1 = time.time()
        classifier = pipeline("audio-classification", model=model_name, trust_remote_code=True)
        t1 = time.time() -  t1
        for i in range(0,df_resd.shape[0],3):
            try:
                wav_name = df_resd.path.iloc[i]
                # speech = df_resd.speech.iloc[i]
                # samples = speech['array']
                # fs = speech['sampling_rate']
                out_f =  'dataset/RESD_csv/test/'  + wav_name
                t1 = time.time()
                label_pred = classifier(out_f)
                t1 = time.time() -  t1
                
                rez_model_s = [ ss['score'] for ss in label_pred]
                k = np.argmax(rez_model_s)
                
                label_true = df_resd.emotion.iloc[i]
                rez_resd.append([i, model_name, wav_name, label_pred[k]['label'],label_true,label_pred, t1 ]) 
                # break
            except:
                pass
    except:
        print('model error')

Some weights of Wav2Vec2ForSpeechClassification were not initialized from the model checkpoint at Aniemore/wav2vec2-emotion-russian-resd and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'classifier.out_proj.bias', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Could not load the `decoder` for Aniemore/wav2vec2-emotion-russian-resd. Defaulting to raw CTC. Error: No module named 'kenlm'
Try to install `kenlm`: `pip install kenlm
Try to install `pyctcdecode`: `pip install pyctcdecode
Some weights of the model checkpoint at Aniemore/hubert-emotion-russian-resd were not used when initializing HubertForSequenceClassification: ['hubert.encoder.pos_conv_embed.conv.weight_v', 'hubert.encoder.pos_conv_embed.conv.weight_g']
- Th

In [123]:
df_K = pd.DataFrame(rez_resd, columns = ['number', 'model', 'file', 'label_pred', 'label_true', 'rez_from_model', 't1' ])
df_K.head()

Unnamed: 0,number,model,file,label_pred,label_true,rez_from_model,t1
0,0,Aniemore/wav2vec2-emotion-russian-resd,neutral_fear_27/27_neutral_fear_n_100.wav,anger,neutral,"[{'score': 0.17598015069961548, 'label': 'ange...",1.761193
1,3,Aniemore/wav2vec2-emotion-russian-resd,anger_fear_42/42_anger_fear_a_190.wav,anger,anger,"[{'score': 0.18954557180404663, 'label': 'ange...",0.346388
2,6,Aniemore/wav2vec2-emotion-russian-resd,fear_disgust_41/41_fear_disgust_f_050.wav,enthusiasm,fear,"[{'score': 0.1559067964553833, 'label': 'enthu...",2.020141
3,9,Aniemore/wav2vec2-emotion-russian-resd,neutral_fear_27/27_neutral_fear_n_020.wav,neutral,neutral,"[{'score': 0.18447254598140717, 'label': 'neut...",1.415429
4,12,Aniemore/wav2vec2-emotion-russian-resd,anger_disgust_19/19_anger_disgust_a_030.wav,anger,anger,"[{'score': 0.18711082637310028, 'label': 'ange...",0.863154


In [124]:
df_K.tail()

Unnamed: 0,number,model,file,label_pred,label_true,rez_from_model,t1
465,267,KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru,enthusiasm_neutral_37/37_enthusiasm_neutral_e_...,neutral,enthusiasm,"[{'label': 'neutral', 'score': 0.44425}, {'lab...",1.123654
466,270,KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru,fear_happiness_30/30_fear_happiness_f_150.wav,positive,fear,"[{'label': 'neutral', 'score': 0.0368}, {'labe...",1.33174
467,273,KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru,sadness_happiness_49/49_sadness_happiness_s_03...,positive,sadness,"[{'label': 'neutral', 'score': 0.4427}, {'labe...",1.966012
468,276,KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru,happiness_neutral_38/38_happiness_neutral_n_03...,angry,neutral,"[{'label': 'neutral', 'score': 0.3054}, {'labe...",0.817347
469,279,KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru,04_fear_enthusiasm/04_fear_enthusiasm f_140.wav,positive,fear,"[{'label': 'neutral', 'score': 0.33407}, {'lab...",1.549704


In [125]:
df_K.to_csv('resd_result.csv')

### обработка данных

In [126]:
df_K = pd.read_csv('resd_result.csv', index_col=0)

In [127]:
df_K['new_label_true'] = df_K['label_true'].values #'neutral'

In [128]:
df_K.loc[df_K.label_true=='happiness','new_label_true'] = 'positive'
df_K.loc[df_K.label_true=='enthusiasm','new_label_true'] = 'positive'
df_K.loc[df_K.label_pred=='happiness','label_pred'] = 'positive'
df_K.loc[df_K.label_pred=='enthusiasm','label_pred'] = 'positive'

In [129]:
df_K.loc[df_K.label_true=='sadness','new_label_true'] = 'sad'
df_K.loc[df_K.label_pred=='sadness','label_pred'] = 'sad'


In [130]:
df_K.loc[df_K.label_true=='disgust','new_label_true'] = 'other'
df_K.loc[df_K.label_true=='fear','new_label_true'] = 'other'
df_K.loc[df_K.label_pred=='disgust','label_pred'] = 'other'
df_K.loc[df_K.label_pred=='fear','label_pred'] = 'other'


In [131]:
df_K.loc[df_K.label_true=='anger','new_label_true'] = 'angry'
df_K.loc[df_K.label_pred=='anger','label_pred'] = 'angry'




In [132]:
df_K.tail()

Unnamed: 0,number,model,file,label_pred,label_true,rez_from_model,t1,new_label_true
465,267,KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru,enthusiasm_neutral_37/37_enthusiasm_neutral_e_...,neutral,enthusiasm,"[{'label': 'neutral', 'score': 0.44425}, {'lab...",1.123654,positive
466,270,KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru,fear_happiness_30/30_fear_happiness_f_150.wav,positive,fear,"[{'label': 'neutral', 'score': 0.0368}, {'labe...",1.33174,other
467,273,KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru,sadness_happiness_49/49_sadness_happiness_s_03...,positive,sadness,"[{'label': 'neutral', 'score': 0.4427}, {'labe...",1.966012,sad
468,276,KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru,happiness_neutral_38/38_happiness_neutral_n_03...,angry,neutral,"[{'label': 'neutral', 'score': 0.3054}, {'labe...",0.817347,neutral
469,279,KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru,04_fear_enthusiasm/04_fear_enthusiasm f_140.wav,positive,fear,"[{'label': 'neutral', 'score': 0.33407}, {'lab...",1.549704,other


In [133]:
df_K.head()

Unnamed: 0,number,model,file,label_pred,label_true,rez_from_model,t1,new_label_true
0,0,Aniemore/wav2vec2-emotion-russian-resd,neutral_fear_27/27_neutral_fear_n_100.wav,angry,neutral,"[{'score': 0.17598015069961548, 'label': 'ange...",1.761193,neutral
1,3,Aniemore/wav2vec2-emotion-russian-resd,anger_fear_42/42_anger_fear_a_190.wav,angry,anger,"[{'score': 0.18954557180404663, 'label': 'ange...",0.346388,angry
2,6,Aniemore/wav2vec2-emotion-russian-resd,fear_disgust_41/41_fear_disgust_f_050.wav,positive,fear,"[{'score': 0.1559067964553833, 'label': 'enthu...",2.020141,other
3,9,Aniemore/wav2vec2-emotion-russian-resd,neutral_fear_27/27_neutral_fear_n_020.wav,neutral,neutral,"[{'score': 0.18447254598140717, 'label': 'neut...",1.415429,neutral
4,12,Aniemore/wav2vec2-emotion-russian-resd,anger_disgust_19/19_anger_disgust_a_030.wav,angry,anger,"[{'score': 0.18711082637310028, 'label': 'ange...",0.863154,angry


In [134]:
acc = np.mean(df_K.label_pred == df_K.new_label_true)

## **Оценка асс на resd**

In [135]:
acc

0.6234042553191489

In [142]:
columns_rez = ['model','dataset', 'acc0', 'acc1', 'f1', 't']
for model_name in model_list:
    df = df_K.loc[df_K.model == model_name,:]
    
    acc0 = accuracy_score(df.new_label_true.values , df.label_pred.values)
    acc1 = accuracy_score(df.new_label_true.values , df.label_pred.values)
    f1 = f1_score(df.new_label_true, df.label_pred.values, average='weighted')
    print(model_name, '****************')
    rez_model_score.append([model_name, 'RESD', acc0, acc1, f1 , np.mean(df.t1)])
rez_model_score_pd1 = pd.DataFrame(rez_model_score, columns=columns_rez)
rez_model_score_pd1.head()

Aniemore/wav2vec2-emotion-russian-resd ****************
Aniemore/hubert-emotion-russian-resd ****************
Aniemore/wavlm-emotion-russian-resd ****************
Aniemore/unispeech-sat-emotion-russian-resd ****************
KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru ****************


Unnamed: 0,model,dataset,acc0,acc1,f1,t
0,Aniemore/wav2vec2-emotion-russian-resd,DUSHA,0.101594,0.149402,0.110919,2.77051
1,Aniemore/hubert-emotion-russian-resd,DUSHA,0.217131,0.436255,0.305991,2.530516
2,Aniemore/wavlm-emotion-russian-resd,DUSHA,0.229084,0.322709,0.319163,1.988178
3,Aniemore/unispeech-sat-emotion-russian-resd,DUSHA,0.356574,0.342629,0.430552,2.920435
4,KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru,DUSHA,0.890438,0.02988,0.889664,2.659251


In [143]:
rez_model_score_pd1.head(40)

Unnamed: 0,model,dataset,acc0,acc1,f1,t
0,Aniemore/wav2vec2-emotion-russian-resd,DUSHA,0.101594,0.149402,0.110919,2.77051
1,Aniemore/hubert-emotion-russian-resd,DUSHA,0.217131,0.436255,0.305991,2.530516
2,Aniemore/wavlm-emotion-russian-resd,DUSHA,0.229084,0.322709,0.319163,1.988178
3,Aniemore/unispeech-sat-emotion-russian-resd,DUSHA,0.356574,0.342629,0.430552,2.920435
4,KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru,DUSHA,0.890438,0.02988,0.889664,2.659251
5,Aniemore/wav2vec2-emotion-russian-resd,RESD,0.340426,0.340426,0.310288,0.947531
6,Aniemore/hubert-emotion-russian-resd,RESD,0.797872,0.797872,0.798911,0.928738
7,Aniemore/wavlm-emotion-russian-resd,RESD,0.87234,0.87234,0.87297,0.971125
8,Aniemore/unispeech-sat-emotion-russian-resd,RESD,0.712766,0.712766,0.703554,0.940354
9,KELONMYOSA/wav2vec2-xls-r-300m-emotion-ru,RESD,0.393617,0.393617,0.323023,0.867656


### Вывод:

    - на чужом датасете сильно хуже

## Интересные проекты в области речи (с моделями, кодом и всякими приятными теоретическими историями, но в массе на английский язык расчитаны, т.е. часть вещей можем брать)

    - ps://github.com/speechbrain/speechbrain ( модели Speech Separation, Speech Enhancement, Voice Activity Detection, Diarization) - еще куча моделей для анализа

## Дополнение: Наш предполагаемый пайплайн обработки

оценка эмоций по звуку речи

In [10]:


import torch
from aniemore.recognizers.voice import VoiceRecognizer
from aniemore.models import HuggingFaceModel

model_w = HuggingFaceModel.Voice.WavLM
device = 'cuda' if torch.cuda.is_available() else 'cpu'
vr = VoiceRecognizer(model=model_w, device=device)

n = 0
wav_name = df_resd_train.name.iloc[n]
out_f =  'dataset/RESD_train/' + str(n) + wav_name
vr.recognize(out_f, return_single_label=True)

Some weights of the model checkpoint at aniemore/wavlm-emotion-russian-resd were not used when initializing WavLMForSequenceClassification: ['wavlm.encoder.pos_conv_embed.conv.weight_g', 'wavlm.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing WavLMForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing WavLMForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of WavLMForSequenceClassification were not initialized from the model checkpoint at aniemore/wavlm-emotion-russian-resd and are newly initialized: ['wavlm.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wavlm.encoder.pos_conv_embed.conv.param

'happiness'

перевод речи в текст

In [11]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model_t = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
forced_decoder_ids = processor.get_decoder_prompt_ids(language="russian", task="transcribe")

wav_name = df_resd_train.name.iloc[n]
speech = df_resd_train.speech.iloc[n]
#
#  тут чтение и создание структуры {'array':[  numpy массив из аудио ], "sampling_rate":16000}
#

input_features = processor(speech["array"], sampling_rate=speech["sampling_rate"], return_tensors="pt").input_features

# generate token ids
predicted_ids = model_t.generate(input_features, forced_decoder_ids=forced_decoder_ids)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
transcription

[' Конечно, скажу, обязательно. Ой, сейчас, ну скажу.']

Оценка эмоций по тексту

In [12]:
import torch
from aniemore.recognizers.text import TextRecognizer
from aniemore.models import HuggingFaceModel

model_e = HuggingFaceModel.Text.Bert_Tiny2
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tr = TextRecognizer(model=model_e, device=device)

tr.recognize(transcription[0], return_single_label=True)

'happiness'

### Весь поток целиком

In [19]:
rez = [] 
for i in range(100):#df_resd_train.shape[0]):
    try:
        t1 =  time.time()
        wav_name = df_resd_train.name.iloc[i]
        speech = df_resd_train.speech.iloc[i]
        samples = speech['array']
        fs = speech['sampling_rate']
        out_f =  'dataset/RESD_train/' + str(i) + wav_name

        s_em = vr.recognize(out_f, return_single_label=True)
        # SER
        input_features = processor(speech["array"], sampling_rate=speech["sampling_rate"], return_tensors="pt").input_features

        # STT
        predicted_ids = model_t.generate(input_features, forced_decoder_ids=forced_decoder_ids)
        # 
        transcription = processor.batch_decode(predicted_ids)
        
        transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
        
        #SA
        t_em = tr.recognize(transcription[0], return_single_label=True)

        # label_pred = model(processor(speech))

        label_true = df_resd_train.emotion.iloc[i]
        t1 =  time.time() - t1 
        rez.append([i, wav_name, label_true, s_em, t_em, t1]) 
        # break
    except:
        pass

In [21]:
df_aniemore_resd = pd.DataFrame(rez, columns=['N','file_name','label_true','label_audio','label_text', 't'])
df_aniemore_resd.to_csv('aniem_resd.csv')

### аккуратность модели
    - по тексту

In [22]:
acc = np.mean(df_aniemore_resd.label_text == df_aniemore_resd.label_true)
acc

0.15384615384615385

    - по голосу

In [23]:
acc = np.mean(df_aniemore_resd.label_audio == df_aniemore_resd.label_true)
acc

1.0

    - среднее время обработки

In [24]:
df_aniemore_resd.t.mean()

1.2905627672488873