## Multimodal Speech Emotion Recognition

### Title & Author

Speech Emotion Recognition: Fusing Audio and Text for Multimodal Emotion Analysis

~ Prince Tholia

### Motivation

Emotion recognition from speech is a fascinating challenge with very less work done even when the field of AI and Machine Learning has developed so far by now. I chose this topic because it exemplifies the power of multimodal data analysis-leveraging both what is said (text) and how it is said (audio)-to infer human emotion more accurately than either modality alone. This has real-world relevance in areas like virtual assistants, call centers, voice transcriptions for audio impaired beings, etc.

### Connection to Multimodal Learning: A Brief Perspective

Multimodal learning brings together different types of information—like images, audio, and text—to help models better understand and interpret data. Thanks to recent progress in deep learning, we can now combine these modalities more effectively, which has led to major improvements in areas such as video captioning, visual question answering, and emotion recognition.

Speech Emotion Recognition (SER) is a great example of a multimodal task. By blending audio cues—like tone, pitch, and energy—with the actual words being spoken, we can make much more accurate predictions about someone’s emotional state. Modern approaches, such as ECAPA-TDNN for processing audio and transformer-based models for analyzing text, highlight how powerful specialized deep learning models can be. When used together with smart fusion techniques, they can significantly boost performance.

### My Learning Journey

#### Data and Preprocessing

- I explored multiple open-source datasets for emotion-labeled speech, including RAVDESS, CREMA-D, TESS, SAVEE, EMOVO, Urdu, and EmoDB. Each dataset provides audio samples annotated with emotions like happy, sad, angry, fear, disgust, surprise, and neutral.

- Preprocessing involved:

    - Audio: Loading files, normalizing, and visualizing waveforms and spectrograms.

    - Text: Using automatic speech recognition (ASR) tools (e.g., Vosk) to transcribe audio into text.



#### Feature Extraction

- Audio: Extracted features such as MFCCs, zero-crossing rate, spectral centroid, RMS energy, and pitch. Data augmentation (noise, pitch shift, time stretch) was used to improve model robustness.

- Text: Used transformer-based models (DistilRoBERTa fine-tuned for emotion classification) to analyze sentiment and emotion from transcribed text.

#### Model Architectures

- Audio Model: Implemented a deep 1D CNN (inspired by ECAPA-TDNN) in PyTorch. Trained on extracted features, it achieved high accuracy (~95%) on test data.

- Text Model: Used HuggingFace pipelines with pre-trained emotion classifiers for text sentiment analysis.

- Fusion: Combined predictions from both modalities using a weighted average, allowing for flexible emphasis on either audio or text.



#### Experimentation & Visualization

Demonstrated the pipeline on sample audio files, showing:

- Audio waveform and spectrogram.

- Transcribed text.

- Emotion probabilities from both modalities.

- Fused emotion prediction and confidence.

- Enhanced output with appropriate emoji for the detected emotion.



<img src = "output.png">

#### Code & Demos

In [None]:
# Audio preprocessing and feature extraction

import librosa
y, sr = librosa.load('sample.wav', duration=2.5, offset=0.6)
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

# Plot waveform and spectrogram

import matplotlib.pyplot as plt
plt.figure(figsize=(14, 5))
librosa.display.waveshow(y, sr=sr)
plt.title('Waveform')
plt.show()


In [None]:
# Transcribe audio to text using Vosk

from vosk import Model, KaldiRecognizer
wf = wave.open('sample.wav', "rb")
model = Model('vosk-model-small-en-us-0.15')
rec = KaldiRecognizer(model, wf.getframerate())
text = ''
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        part = json.loads(rec.Result())
        text += part.get('text', '')


In [None]:
# Text emotion analysis

from transformers import pipeline
emotion_classifier = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", top_k=None)
emotion_scores = emotion_classifier(text)[0]


In [None]:
# Audio emotion analysis (deep model)

import torch
model = AudioModel(input_shape=(2376,), num_classes=7)
model.load_state_dict(torch.load('audio_model.pth'))
features = extract_features(y, sr)
features_tensor = torch.tensor(features, dtype=torch.float32).unsqueeze(0).unsqueeze(0)
outputs = model(features_tensor)
probabilities = torch.softmax(outputs, dim=1).cpu().numpy()[0]


In [None]:
# Fusion and visualization

fused_emotions = {emo: 0.6 * text_probs.get(emo, 0) + 0.4 * audio_probs.get(emo, 0) for emo in all_emotions}
plt.bar(fused_emotions.keys(), fused_emotions.values())
plt.title('Fused Emotion Probabilities')
plt.show()


### Reflections

#### What surprised me?

- The accuracy boost from combining modalities was significant, especially in ambiguous cases where either the text or the tone alone was insufficient.

- Every person has different context for each emoji in their own mind, which makes it difficult to map emotions to emojis accurately for large crowd.

#### Scope for Improvement



- Integrate end-to-end multimodal models (e.g., transformers that jointly process audio and text).

- Explore more sophisticated fusion techniques, such as attention-based weighting or late fusion with confidence calibration.

- Use larger, more diverse datasets and experiment with cross-lingual emotion recognition.

### References

- RAVDESS Dataset

- CREMA-D Dataset

- Vosk Speech Recognition Toolkit

- ECAPA-TDNN Paper

- DistilRoBERTa Emotion Model