# Emotion Recognition Model Comparison

This notebook tests the emotion models used in the VocalMind pipeline:
- **Text**: `j-hartmann/emotion-english-distilroberta-base` (7 classes)
- **Audio**: `audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim` (dimensional → mapped to 7 classes)

## Setup

In [25]:
import os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"
import warnings
warnings.filterwarnings("ignore")

import torch
import numpy as np
import librosa
from transformers import pipeline, Wav2Vec2Processor

device = 0 if torch.cuda.is_available() else -1
print(f"Device: {'CUDA' if device == 0 else 'CPU'}")

Device: CUDA


## 1. Text Emotion Model: `j-hartmann/emotion-english-distilroberta-base`

**Classes (7):** anger, disgust, fear, joy, neutral, sadness, surprise

In [26]:
text_emotion_model = "j-hartmann/emotion-english-distilroberta-base"
text_classifier = pipeline("text-classification", model=text_emotion_model, device=device, top_k=None)
print(f"Loaded: {text_emotion_model}")

Loaded: j-hartmann/emotion-english-distilroberta-base


In [27]:
# Test sentences
test_sentences = [
    "I'm so happy today, everything is going great!",
    "I've been charged twice. This is unacceptable.",
    "I'm so sorry about that. Let me look into it.",
    "Really? A credit too? That's amazing!",
    "The meeting is scheduled for 3 PM tomorrow."
]

print("Text Emotion Classification Results:\n")
for sentence in test_sentences:
    result = text_classifier(sentence)[0]
    top = max(result, key=lambda x: x['score'])
    print(f"'{sentence[:50]}...' -> {top['label'].upper()} ({top['score']:.1%})")

Text Emotion Classification Results:

'I'm so happy today, everything is going great!...' -> JOY (97.2%)
'I've been charged twice. This is unacceptable....' -> ANGER (82.3%)
'I'm so sorry about that. Let me look into it....' -> SADNESS (90.8%)
'Really? A credit too? That's amazing!...' -> SURPRISE (95.3%)
'The meeting is scheduled for 3 PM tomorrow....' -> NEUTRAL (70.4%)


## 2. Audio Emotion Model: `audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim`

**Output**: Dimensional values (arousal, dominance, valence) on 0-1 scale

Trained on MSP-Podcast dataset (natural conversational speech).

In [28]:
import torch.nn as nn
from transformers import Wav2Vec2Model, Wav2Vec2PreTrainedModel

class EmotionModel(Wav2Vec2PreTrainedModel):
    """Custom model wrapper for audeering's dimensional emotion model."""
    def __init__(self, config):
        super().__init__(config)
        self.wav2vec2 = Wav2Vec2Model(config)
        self.classifier = nn.Linear(config.hidden_size, 3)  # arousal, dominance, valence
        self.init_weights()
    
    def forward(self, input_values):
        outputs = self.wav2vec2(input_values)
        hidden_states = outputs.last_hidden_state
        pooled = hidden_states.mean(dim=1)
        logits = self.classifier(pooled)
        return hidden_states, logits

audio_model_name = "audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim"
processor = Wav2Vec2Processor.from_pretrained(audio_model_name)
audio_model = EmotionModel.from_pretrained(audio_model_name)
audio_model.eval()
if torch.cuda.is_available():
    audio_model = audio_model.cuda()
print(f"Loaded: {audio_model_name}")

Some weights of EmotionModel were not initialized from the model checkpoint at audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim and are newly initialized: ['classifier.bias', 'classifier.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loaded: audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim


In [29]:
def dimensions_to_emotion(arousal, valence, dominance):
    """Map dimensional values to emotion category."""
    if arousal > 0.7:
        if valence > 0.5:
            return "joy"
        else:
            return "anger" if dominance > 0.5 else "fear"
    if valence > 0.6:
        return "joy"
    if valence < 0.4:
        if arousal > 0.6:
            return "anger" if dominance > 0.5 else "fear"
        elif arousal < 0.4:
            return "sadness"
        else:
            return "disgust"
    return "neutral"

def predict_audio_emotion(audio_path):
    """Predict emotion from audio file."""
    audio, sr = librosa.load(audio_path, sr=16000)
    inputs = processor(audio, sampling_rate=16000)
    input_values = torch.tensor(inputs.input_values).reshape(1, -1)
    if torch.cuda.is_available():
        input_values = input_values.cuda()
    
    with torch.no_grad():
        _, logits = audio_model(input_values)
        dims = logits[0].cpu().numpy()
    
    arousal, dominance, valence = float(dims[0]), float(dims[1]), float(dims[2])
    emotion = dimensions_to_emotion(arousal, valence, dominance)
    
    return {
        'emotion': emotion,
        'arousal': arousal,
        'dominance': dominance,
        'valence': valence
    }

## 3. Test on Audio Sample

In [31]:
# Test with sample audio (update path as needed)
audio_path = "../Voice-Generation/generated_audio/hard_overlap.mp3"

if os.path.exists(audio_path):
    result = predict_audio_emotion(audio_path)
    print(f"Audio Emotion: {result['emotion'].upper()}")
    print(f"Dimensions: Arousal={result['arousal']:.2f}, Valence={result['valence']:.2f}, Dominance={result['dominance']:.2f}")
else:
    print(f"Audio file not found: {audio_path}")

Audio Emotion: SADNESS
Dimensions: Arousal=0.01, Valence=-0.01, Dominance=0.05


## Summary

| Model | Task | Output |
|-------|------|--------|
| `j-hartmann/emotion-english-distilroberta-base` | Text | 7 emotion classes |
| `audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim` | Audio | Valence/Arousal/Dominance → mapped to emotions |