# **Speech to Emotion Recognition**
James Knee, Tyler Nguyen, Varsha Singh, Anish Sinha, Nathan Strahs


---

### Task ###
Our task is to use a deep learning architecture to identify the underlying emotion given some English speaking audio, formally known as Speech to Emotion Recognition (SER). Identifying emotions from speech is hard enough for people, and it requires careful analysis over time. Emotional conveyance is also subjective; different speakers articulate emotions differently, implying variations in pitch, intensity, rhythm, and cadence. This task is also challenging due to the complexity of raw audio signals, so the data will require significant amounts of preprocessing. In the end, we would like our model to differentiate between anger, disgust, fear, happiness, sadness, and neutrality.


---

# Architecture Overview

1. **Preprocessing**
      - Normalize audio volume
      - Convert audio to time-frequency representations like Spectograms
2. **Feature Extraction via ResNet**
      - Feed spectogram into Residual Network
      - Retain extracted features by removing final classification layer in ResNet
3. **Temporal Modeling via Transformer Encoder**
      - Pass ResNet output to transformer and capture long-range dependencies and sequential relationships in the audio
4. **Classification Layer**
      - Apply a softmax layer to classify the output into one of six emotion categories: anger, disgust, fear, happiness, sadness, neutrality.

Alternative Model: State Space Model (SSM) such as Mamba

[insert diagram here]

# Datasets

Below are the datasets we will use for our Speech Emotion Recognition project:

- **CREMA-D (Crowd-Sourced Emotional Multimodal Actors Dataset)**
  - **Description**: An audio-visual dataset comprising 7,442 clips from 91 actors (48 male, 43 female) aged between 20 and 74, representing diverse ethnic backgrounds. Actors vocalized 12 sentences expressing six emotions: anger, disgust, fear, happiness, neutral, and sadness. Each clip has multiple ratings for audio-only, visual-only, and audio-visual presentations.
  - **Link**: https://www.kaggle.com/datasets/ejlok1/cremad

- **RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)**
  - **Description**: Comprises 7,356 files from 24 professional actors (12 male, 12 female) speaking two lexically-matched statements in a neutral North American accent. Speech includes eight emotions: neutral, calm, happy, sad, angry, fearful, surprise, and disgust, each at two intensity levels. Available in audio-only, video-only, and audio-visual formats.
  - **Link**: https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio

- **Berlin Emotional Database**
  - **Description**: Contains 535 utterances from ten actors (five male, five female) expressing seven emotions: anger, boredom, disgust, fear, happiness, sadness, and neutral. Recorded at 48kHz and downsampled to 16kHz.
  - **Link**: http://emodb.bilderbar.info/
  - **Kaggle Link**: https://www.kaggle.com/datasets/piyushagni5/berlin-database-of-emotional-speech-emodb


# Preprocessing Data

In [35]:
#imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchaudio
from torchvision import datasets
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader
import os

import random
import torchaudio.transforms as T

In [36]:
#necessary variables
DATA_PATH="AudioWAV/"

training_split=0.8
testing_split=0.2
batch_size = 32

In [37]:
class AudioDataset(Dataset):
    def __init__(self, data_dir, transform=False, target_length=160):
        self.data_dir = data_dir
        self.transform = transform
        self.target_length = target_length

        self.emotion_map = {
            "ANG": 0, "DIS": 1, "FEA": 2,
            "HAP": 3, "NEU": 4, "SAD": 5
        }

        # Filter only valid files with known emotion labels
        self.audio_files = [
            f for f in os.listdir(data_dir)
            if f.endswith('.wav') and f.split('_')[2] in self.emotion_map
        ]

        # Extract labels
        self.strlabels = [f.split('_')[2] for f in self.audio_files]
        self.labels = [self.emotion_map[label] for label in self.strlabels]

        # Fixed transforms
        self.sample_rate = 16000
        self.mel_transform = T.MelSpectrogram(
            sample_rate=self.sample_rate,
            n_fft=2048,
            hop_length=512,
            n_mels=128
        )
        self.db_transform = T.AmplitudeToDB()

        # Resampler reused for efficiency
        self.resampler = T.Resample(orig_freq=48000, new_freq=self.sample_rate)  # Assume worst-case

    def __len__(self):
        return len(self.audio_files)

    def __getitem__(self, idx):
        file_path = os.path.join(self.data_dir, self.audio_files[idx])
        waveform, sample_rate = torchaudio.load(file_path)

        # Resample to 16kHz if needed
        if sample_rate != self.sample_rate:
            resample = T.Resample(orig_freq=sample_rate, new_freq=self.sample_rate)
            waveform = resample(waveform)

        # Convert stereo to mono
        if waveform.shape[0] > 1:
            waveform = waveform.mean(dim=0, keepdim=True)

        # Normalize waveform
        waveform = waveform - waveform.mean()

        # Volume augmentation on waveform
        if self.transform and random.random() < 0.5:
            waveform = T.Vol(gain=(0.5, 1.5), gain_type="amplitude")(waveform)

        # Compute Mel spectrogram and convert to dB
        mel_spec = self.mel_transform(waveform)
        mel_spec = self.db_transform(mel_spec)

        # MinMax normalization to [0, 1]
        mel_min = mel_spec.min()
        mel_max = mel_spec.max()
        mel_spec = (mel_spec - mel_min) / (mel_max - mel_min + 1e-6)

        # Spectrogram-level augmentation
        if self.transform:
            if random.random() < 0.5:
                mel_spec = T.FrequencyMasking(freq_mask_param=15)(mel_spec)
            if random.random() < 0.5:
                mel_spec = T.TimeMasking(time_mask_param=35)(mel_spec)

        # Fix time dimension by padding or cropping
        current_length = mel_spec.shape[-1]
        if current_length < self.target_length:
            pad_amount = self.target_length - current_length
            mel_spec = F.pad(mel_spec, (0, pad_amount))
        else:
            mel_spec = mel_spec[:, :, :self.target_length]

        label = torch.tensor(self.labels[idx], dtype=torch.long)

        # Remove channel dimension if needed (1, 128, T) -> (128, T)
        mel_spec = mel_spec.squeeze(0)

        return mel_spec, label


In [38]:
#this function pads per batch so that every spectogram is the same dimension per batch

def collate_fn(batch):
    spectrograms, labels = zip(*batch)
    
    max_length = max(spec.shape[1] for spec in spectrograms)

    #pad spectrograms to match longest
    spectrograms_padded = [torch.nn.functional.pad(spec, (0, max_length - spec.shape[1])) for spec in spectrograms]

    # Convert list to tensor
    spectrograms_padded = torch.stack(spectrograms_padded)

    labels = torch.tensor(labels, dtype=torch.long)
    return spectrograms_padded, labels

In [39]:
#declaring dataset
dataset = AudioDataset(DATA_PATH)

#calculate training size and testing size
train_size = int(dataset.__len__()*training_split)
test_size = dataset.__len__()-train_size

train_set, test_set = torch.utils.data.random_split(dataset, [train_size, test_size])

train_set.dataset.transform = True
test_set.dataset.transform = False

#dataloaders
train_loader = DataLoader(train_set, batch_size=batch_size, collate_fn=collate_fn, shuffle=True)
test_loader = DataLoader(test_set, batch_size=batch_size, collate_fn=collate_fn, shuffle=False)

#FINAL DIMENSIONS OF SPECS: BatchSize x 128 x MaxTimeLength

# Declaring Models

In [68]:
#all models should accept inputs of differnet lengths (shouldn't have to worry about mamba)
#we should look into using global adaptive pooling

'''
TODO: 

MODELS THAT WE NEED TO MAKE:
CNN-Transformer: Should we use a resnet on this? Would that be overkill? We could use a resnet
    and train it ourselves (not sure if a pretrained resnet would be great)
    
Regular CNN: this will be our base model for comparison. We should play around with this, and
    this should be the same kind of CNN that we use in our other models (i.e. resnet?)
    
Mamba Model: we should train a basic mamba model

Mamba-CNN: we should incorporate a cnn with a mamba model

Pretrained SOTA model: we should delcare a pretrained state of the art model and compare against that
'''
        
class Base_CNN(nn.Module):
    def __init__(self):
        super(Base_CNN, self).__init__()
        
        #declare layers
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)

        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)
        
        self.conv4 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm2d(256)

        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.relu = nn.ReLU()
        
        self.global_pool = nn.AdaptiveAvgPool2d((1,1))
        self.max_pool = nn.AdaptiveMaxPool2d((1,1))
        
        self.classifier1 = nn.Linear(256, 64)
        self.classifier2 = nn.Linear(64, 6)
        
        self.residualConv = nn.Conv2d(1, 64, kernel_size=1, stride=2, padding=1)
        
    def forward(self, x):
        residual = x

        x = self.conv1(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.bn1(x)

        x = self.conv2(x)
        
        residual = self.residualConv(residual)
        
        x = F.pad(x, (0, 1))
        
        residual = residual[:, :, :x.shape[2], :x.shape[3]]
        
        # print(f"x shape: {x.shape}, res shape: {residual.shape}")
        
        x = x + residual
        
        x = self.relu(x)
        x = self.pool(x)
        x = self.bn2(x)

        x = self.conv3(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.bn3(x)
        
        x = self.conv4(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.bn4(x)
        
        #to fix dimensionality
        x = self.max_pool(x)
        x = torch.flatten(x, 1)
        
        x = self.classifier1(x)
        x = self.classifier2(x)

        return x

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * -(torch.log(torch.tensor(10000.0)) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0)]
        return x


class Base_CNN_Transformer(nn.Module):
    def __init__(self, transformer_layers=2, n_heads=4, transformer_dim=256, input_freq_bins=8):
        super(Base_CNN_Transformer, self).__init__()

        # CNN layers
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)

        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)

        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)

        self.conv4 = nn.Conv2d(128, transformer_dim, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm2d(transformer_dim)

        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.relu = nn.ReLU()

        # Projection from D*F to D
        self.project = nn.Linear(transformer_dim * input_freq_bins, transformer_dim)

        # Transformer
        encoder_layer = nn.TransformerEncoderLayer(d_model=transformer_dim, nhead=n_heads, dropout=0.2)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=transformer_layers)
        self.pos_encoder = PositionalEncoding(transformer_dim)

        # Classification
        self.global_pool = nn.AdaptiveAvgPool1d(1)
        self.classifier = nn.Sequential(
            nn.Linear(transformer_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 6)
        )

    def apply_layernorm(self, x):
        B, C, H, W = x.shape
        return nn.LayerNorm([C, H, W]).to(x.device)(x)

    def forward(self, x):
        # CNN
        x = self.relu(self.apply_layernorm(self.pool(self.conv1(x))))
        x = self.relu(self.apply_layernorm(self.pool(self.conv2(x))))
        x = self.relu(self.apply_layernorm(self.pool(self.conv3(x))))
        x = self.relu(self.apply_layernorm(self.pool(self.conv4(x))))  # [B, D, F, T]

        B, D, F, T = x.shape

        # Rearrange for transformer: each time step is a token
        x = x.permute(0, 3, 1, 2)         # [B, T, D, F]
        x = x.reshape(B, T, D * F)        # [B, T, D*F]
        x = self.project(x)               # [B, T, D]

        # Transformer expects [T, B, D]
        x = x.permute(1, 0, 2)            # [T, B, D]
        x = self.pos_encoder(x)
        x = self.transformer_encoder(x)

        # Back to [B, D, T] for pooling
        x = x.permute(1, 2, 0)            # [B, D, T]
        x = self.global_pool(x).squeeze(2)  # [B, D]

        x = self.classifier(x)  # [B, 6]
        return x

class Base_CNN_GRU(nn.Module):
    def __init__(self):
        super(Base_CNN_GRU, self).__init__()

        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)

        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)

        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)

        self.conv4 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm2d(256)

        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.relu = nn.ReLU()

        self.residualConv = nn.Conv2d(1, 64, kernel_size=1, stride=2, padding=1)

        # GRU expects input_size = 256 (channels), and sequence length = width
        self.gru = nn.GRU(input_size=256*8, hidden_size=128, num_layers=1,
                          batch_first=True, bidirectional=True)

        self.classifier1 = nn.Linear(128 * 2, 64)  # bidirectional
        self.classifier2 = nn.Linear(64, 6)

    def forward(self, x):
        residual = x  # x: [B, 1, 128, 256]

        x = self.conv1(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.bn1(x)

        x = self.conv2(x)
        residual = self.residualConv(residual)
        x = F.pad(x, (0, 1))  # pad width to align
        residual = residual[:, :, :x.shape[2], :x.shape[3]]
        x = x + residual

        x = self.relu(x)
        x = self.pool(x)
        x = self.bn2(x)

        x = self.conv3(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.bn3(x)

        x = self.conv4(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.bn4(x)

        # x: [B, 256, H, W] — after all CNN and pooling layers
        B, C, H, W = x.shape
        
        # Reshape for GRU: treat W as time steps, and C*H as input features
        x = x.permute(0, 3, 1, 2)  # [B, W, C, H]
        x = x.contiguous().view(B, W, C * H)  # [B, W, C*H]

        # Update GRU input size if needed
        x, _ = self.gru(x)  # GRU input_size = C*H

        x = x[:, -1, :]  # last time step
        x = self.classifier1(x)
        x = self.classifier2(x)

        return x

# Test Network

In [69]:
dummy_model = Base_CNN_GRU()

dummy_input = torch.randn(1, 128, 256).unsqueeze(1)

output = dummy_model(dummy_input)

print("Input shape: ", dummy_input.shape)
print("Output shape: ", output.shape)
print(output)

Input shape:  torch.Size([1, 1, 128, 256])
Output shape:  torch.Size([1, 6])
tensor([[-0.3893,  0.0243,  0.1863,  0.2853, -0.0108,  0.2773]],
       grad_fn=<AddmmBackward0>)


In [70]:
print(torch.__version__)  # PyTorch version
print(torch.version.cuda)  # CUDA version
print(torch.backends.cudnn.version())  # cuDNN version

1.13.1+cu116
11.6
8302


# Training

In [43]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


In [44]:
def train_model(model, optimizer, criterion, device, train_loader, num_epochs=10):
    model.to(device)
    model.train()

    for epoch in range(num_epochs):
        running_loss = 0.0

        for i, (inputs, labels) in enumerate(train_loader):
            inputs = inputs.unsqueeze(1)
            inputs, labels = inputs.to(device), labels.to(device)
            

            optimizer.zero_grad()

            outputs = model(inputs)

            loss = criterion(outputs, labels)

            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        print(f"Epoch {epoch + 1}/{num_epochs} Loss: {running_loss:.4f}")

    print("Finished Training")
    return model

def test_model(model, test_loader, device):
    model.to(device)
    model.eval()
    
    correct = 0
    total = 0
    
    # no need for gradients in testing
    with torch.no_grad():
        for data in test_loader:
            inputs, labels = data
            inputs = inputs.unsqueeze(1)
            
            inputs = inputs.to(device)
            labels = labels.to(device)
            
            # calculate outputs by running images through the network
            outputs = model(inputs)
            
            # the class with the highest value is prediction
            _, predicted = torch.max(outputs, 1)
            
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    acc = 100 * correct / total
    return acc

# Base CNN Training

In [54]:
model = Base_CNN().to(device)

criterion = nn.CrossEntropyLoss().to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [55]:
trained_model = train_model(model, optimizer, criterion, device, train_loader, num_epochs=10)

Epoch 1/10 Loss: 277.8612
Epoch 2/10 Loss: 242.1291
Epoch 3/10 Loss: 227.7572
Epoch 4/10 Loss: 206.3700
Epoch 5/10 Loss: 193.8552
Epoch 6/10 Loss: 172.9846
Epoch 7/10 Loss: 155.0231
Epoch 8/10 Loss: 135.0378
Epoch 9/10 Loss: 126.9739
Epoch 10/10 Loss: 103.0389
Finished Training


In [56]:
acc = test_model(trained_model, test_loader, device)

print(f"Accuracy of the model: {acc:.2f}")

Accuracy of the model: 60.71


# ResNet Training

In [22]:
resnet = torchvision.models.resnet18(weights=False)

# Set to 6 output classes
resnet.fc = nn.Linear(in_features=512, out_features=6)
resnet.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)

for param in resnet.parameters():
    param.requires_grad = True


criterion = nn.CrossEntropyLoss().to(device)

optimizer = torch.optim.Adam(resnet.parameters(), lr=0.001)




In [23]:
resnet = train_model(resnet, optimizer, criterion, device, train_loader, num_epochs=10)

Epoch 1/10 Loss: 267.1126
Epoch 2/10 Loss: 233.0354
Epoch 3/10 Loss: 216.8892
Epoch 4/10 Loss: 201.1892
Epoch 5/10 Loss: 187.5245
Epoch 6/10 Loss: 176.7374
Epoch 7/10 Loss: 162.0703
Epoch 8/10 Loss: 144.9984
Epoch 9/10 Loss: 123.7047
Epoch 10/10 Loss: 106.9874
Finished Training


In [24]:
acc = test_model(resnet, test_loader, device)

print(f"Accuracy of the model: {acc:.2f}")

Accuracy of the model: 56.35


# Base CNN Training (Alternative Hyperparameters)

In [25]:
model = Base_CNN().to(device)

criterion = nn.CrossEntropyLoss().to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

In [26]:
model = train_model(model, optimizer, criterion, device, train_loader, num_epochs=20)

Epoch 1/20 Loss: 316.0492
Epoch 2/20 Loss: 263.4397
Epoch 3/20 Loss: 246.0173
Epoch 4/20 Loss: 225.8965
Epoch 5/20 Loss: 213.6423
Epoch 6/20 Loss: 198.3253
Epoch 7/20 Loss: 182.8854
Epoch 8/20 Loss: 177.3281
Epoch 9/20 Loss: 153.2265
Epoch 10/20 Loss: 150.8936
Epoch 11/20 Loss: 132.3498
Epoch 12/20 Loss: 119.7052
Epoch 13/20 Loss: 90.1852
Epoch 14/20 Loss: 92.4093
Epoch 15/20 Loss: 64.6251
Epoch 16/20 Loss: 56.2356
Epoch 17/20 Loss: 65.6004
Epoch 18/20 Loss: 37.1610
Epoch 19/20 Loss: 16.2058
Epoch 20/20 Loss: 34.5939
Finished Training


In [27]:
acc = test_model(model, test_loader, device)

print(f"Accuracy of the model: {acc:.2f}")

Accuracy of the model: 56.68


# CNN with Transformer Training

In [80]:
model = Base_CNN_Transformer().to(device)

criterion = nn.CrossEntropyLoss().to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

In [81]:
model = train_model(model, optimizer, criterion, device, train_loader, num_epochs=20)

Epoch 1/20 Loss: 293.5340
Epoch 2/20 Loss: 264.7619
Epoch 3/20 Loss: 250.5074
Epoch 4/20 Loss: 238.6807
Epoch 5/20 Loss: 222.9583
Epoch 6/20 Loss: 215.1919
Epoch 7/20 Loss: 202.2527
Epoch 8/20 Loss: 186.1087
Epoch 9/20 Loss: 178.2047
Epoch 10/20 Loss: 157.3877
Epoch 11/20 Loss: 144.3840
Epoch 12/20 Loss: 129.1506
Epoch 13/20 Loss: 108.6973
Epoch 14/20 Loss: 141.1226
Epoch 15/20 Loss: 79.3764
Epoch 16/20 Loss: 67.9324
Epoch 17/20 Loss: 57.0533
Epoch 18/20 Loss: 49.2521
Epoch 19/20 Loss: 39.5618
Epoch 20/20 Loss: 32.9790
Finished Training


In [82]:
acc = test_model(model, test_loader, device)

print(f"Accuracy of the model: {acc:.2f}")

Accuracy of the model: 63.13


# CNN with GRU Training

In [13]:
model = Base_CNN_GRU().to(device)

criterion = nn.CrossEntropyLoss().to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.00025)

In [14]:
model = train_model(model, optimizer, criterion, device, train_loader, num_epochs=15)

Epoch 1/15 Loss: 277.9192
Epoch 2/15 Loss: 235.7404
Epoch 3/15 Loss: 211.3875
Epoch 4/15 Loss: 190.6624
Epoch 5/15 Loss: 168.0269
Epoch 6/15 Loss: 143.8124
Epoch 7/15 Loss: 116.5335
Epoch 8/15 Loss: 104.6810
Epoch 9/15 Loss: 56.9923
Epoch 10/15 Loss: 29.1031
Epoch 11/15 Loss: 21.3106
Epoch 12/15 Loss: 7.2766
Epoch 13/15 Loss: 11.1959
Epoch 14/15 Loss: 17.6103
Epoch 15/15 Loss: 45.1654
Finished Training


In [15]:
acc = test_model(model, test_loader, device)

print(f"Accuracy of the model: {acc:.2f}")

Accuracy of the model: 59.57
