Emotion recognition is a powerful tool that can be used in a variety of applications, such as improving customer service, personalizing user experiences, and helping people with speech disorders communicate more effectively.
This project aimes to develop a solution that can accurately detect and analyze the emotional state of call center employees during customer interactions, leveraging the power of transformer models such as Wav2Vec2.0, HuBERT and WavLM.
The are two datasets that are used for this project:
- Dusha is a bi-modal corpus suitable for speech emotion recognition (SER) tasks. The dataset consists of about 300 000 audio recordings with Russian speech, their transcripts and emotiomal labels. The corpus contains approximately 350 hours of data. Four basic emotions that usually appear in a dialog with a virtual assistant were selected: Happiness (Positive), Sadness, Anger and Neutral emotion.
NB: In this project only small subset of Dusha dataset was used.
- EmoCall is a data set of 329 telephone recordings with Russian speech from 10 actors. Actors spoke from a selection of 10 sentences for each emotion. The sentences were presented using one of six different emotions (Anger, Positive, Neutral, Sad and Other).
The files contain speech that is sampled at 16 kHz and saved as 16-bit PCM WAV files.
All checkpoints can be found here
| Models | Pretrained Checkpoints |
|---|---|
| Wav2Vec2.0 | facebook/hubert-large-ls960-ft |
| HuBERT | jonatasgrosman/wav2vec2-large-xlsr-53-russian |
| WavLM | microsoft/wavlm-large |
| Models (Group_1) | Checkpoints |
|---|---|
| Wav2Vec2.0 | dusha/wav2vec2/audio-model |
| HuBERT | dusha/hubert/audio-model |
| WavLM | dusha/wavlm/audio-model |
| Models (Group_2) | Checkpoints |
|---|---|
| Wav2Vec2.0 | emocall/wav2vec/audio-model |
| HuBERT | emocall/hubert/audio-model |
| WavLM | emocall/wavlm/audio-model |
In the scripts folder you can fined training and evaluation scripts for Wav2Vec2.0, HuBERT, WavLM on Dusha and EmoCall Datasets.
| Models | Accuracy on EmoCall (Group_1) | Accuracy on EmoCall (Group_2) |
|---|---|---|
| Wav2Vec2.0 | 0.88 | 0.98 |
| HuBERT | 0.73 | 0.98 |
| WavLM | 0.93 | 0.99 |
According to the results of evaluation on EmoCall dataset, it was decided to use WavLM model (Group_2) for the prototype application as it demonstrates its ability to accurately recognize emotions in speech 99% accuracy.
To install the necessary dependencies, run the following command: Clone the repository:
git clone https://github.com/AlinaShapiro/Audio-Classification-HF.git
Install the requirements
pip install -r requirements.txt
To run the app, run the following command:
python app.pySelect a video(.mp4) or audio(.wav, .mp3, .m4a) file to analyze by cklicking on "Выбрать файл" button

Cklick on a playback button to play selected video or audio file

Cklick on a "Анализировать эмоции" button to analayze file on the subject of presence of certain emotion (Anger, Positive, Neutral, Sad and Other) with WavLM model.

After the process of analyzing is done you can see a graphical report of emotional state
In addition, you can dowload an emotional state report to .json file by cklicking on "Выгрузить отчет" button.

Emotion recognition in speech is a challenging but important task that has many practical applications. By using application that accuratly identifies emotional state of call-center employees call-center athorities may use that to enhance employees' emotional well-being, productivity level.
