# **Speech to Emotion Recognition**
James Knee, Tyler Nguyen, Varsha Singh, Anish Sinha, Nathan Strahs


---

### Task ###
Our task is to use a deep learning architecture to identify the underlying emotion given some English speaking audio, formally known as Speech to Emotion Recognition (SER). Identifying emotions from speech is hard enough for people, and it requires careful analysis over time. Emotional conveyance is also subjective; different speakers articulate emotions differently, implying variations in pitch, intensity, rhythm, and cadence. This task is also challenging due to the complexity of raw audio signals, so the data will require significant amounts of preprocessing. In the end, we would like our model to differentiate between anger, disgust, fear, happiness, sadness, and neutrality.


---

# Architecture Overview

1. **Preprocessing**
      - Normalize audio volume
      - Convert audio to time-frequency representations like Spectograms
2. **Feature Extraction via ResNet**
      - Feed spectogram into Residual Network
      - Retain extracted features by removing final classification layer in ResNet
3. **Temporal Modeling via Transformer Encoder**
      - Pass ResNet output to transformer and capture long-range dependencies and sequential relationships in the audio
4. **Classification Layer**
      - Apply a softmax layer to classify the output into one of six emotion categories: anger, disgust, fear, happiness, sadness, neutrality.

Alternative Model: State Space Model (SSM) such as Mamba

[insert diagram here]

# Datasets

Below are the datasets we will use for our Speech Emotion Recognition project:

- **CREMA-D (Crowd-Sourced Emotional Multimodal Actors Dataset)**
  - **Description**: An audio-visual dataset comprising 7,442 clips from 91 actors (48 male, 43 female) aged between 20 and 74, representing diverse ethnic backgrounds. Actors vocalized 12 sentences expressing six emotions: anger, disgust, fear, happiness, neutral, and sadness. Each clip has multiple ratings for audio-only, visual-only, and audio-visual presentations.
  - **Link**: https://www.kaggle.com/datasets/ejlok1/cremad

- **RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)**
  - **Description**: Comprises 7,356 files from 24 professional actors (12 male, 12 female) speaking two lexically-matched statements in a neutral North American accent. Speech includes eight emotions: neutral, calm, happy, sad, angry, fearful, surprise, and disgust, each at two intensity levels. Available in audio-only, video-only, and audio-visual formats.
  - **Link**: https://www.kaggle.com/datasets/uwrfkaggler/ravdess-emotional-speech-audio

- **Berlin Emotional Database**
  - **Description**: Contains 535 utterances from ten actors (five male, five female) expressing seven emotions: anger, boredom, disgust, fear, happiness, sadness, and neutral. Recorded at 48kHz and downsampled to 16kHz.
  - **Link**: http://emodb.bilderbar.info/
  - **Kaggle Link**: https://www.kaggle.com/datasets/piyushagni5/berlin-database-of-emotional-speech-emodb


# Preprocessing Data

In [1]:
import sys
print(sys.executable)

/projectnb/ec523/projects/teamSER/miniconda/envs/mamba-env/bin/python


In [2]:
#imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchaudio
from torchvision import datasets
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader
import os

import random
import torchaudio.transforms as T

In [3]:
! nvidia-smi

Wed Apr  9 19:02:39 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.06             Driver Version: 570.124.06     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla V100-SXM2-16GB           On  |   00000000:18:00.0 Off |                    0 |
| N/A   40C    P0             43W /  300W |       4MiB /  16384MiB |      0%   E. Process |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-SXM2-16GB           On  |   00

In [2]:
import mamba_ssm
print(dir(mamba_ssm))

['Mamba', 'Mamba2', 'MambaLMHeadModel', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'distributed', 'mamba_inner_fn', 'models', 'modules', 'ops', 'selective_scan_fn', 'utils']


In [6]:
import torch
from mamba_ssm import Mamba

# Check if a GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("USING: " + device.type)

# Define a tiny Mamba model and move it to the GPU
model = Mamba(
    d_model=16,  # embedding dimension
    d_state=8,   # internal state dimension
    d_conv=4,    # convolution dimension
    expand=2,    # expansion factor
).to(device)

# Generate random input (batch_size, sequence_length, embedding_dim) and move it to the GPU
x = torch.randn(2, 10, 16).to(device)

# Forward pass
output = model(x)

print("Mamba output shape:", output.shape)


USING: cuda
Mamba output shape: torch.Size([2, 10, 16])


In [8]:
#necessary variables, assuming root directory is /projectnb/ec523/projects/teamSER folder
DATA_PATH="AudioWAV/"

training_split=0.8
testing_split=0.2
batch_size = 32

In [9]:
class AudioDataset(Dataset):
    def __init__(self, data_dir, transform=False, target_length=160):
        self.data_dir = data_dir
        self.transform = transform
        self.target_length = target_length

        # enumeration of emotions
        self.emotion_map = {
            "ANG": 0, "DIS": 1, "FEA": 2,
            "HAP": 3, "NEU": 4, "SAD": 5
        }

        # Filter only valid files with known emotion labels
        self.audio_files = [
            f for f in os.listdir(data_dir)
            if f.endswith('.wav') and f.split('_')[2] in self.emotion_map
        ]

        # Extract labels
        self.strlabels = [f.split('_')[2] for f in self.audio_files]
        self.labels = [self.emotion_map[label] for label in self.strlabels]

        # Fixed transforms
        self.sample_rate = 16000
        self.mel_transform = T.MelSpectrogram(
            sample_rate=self.sample_rate,
            n_fft=2048,
            hop_length=512,
            n_mels=128
        )
        self.db_transform = T.AmplitudeToDB()

        # Resampler reused for efficiency
        self.resampler = T.Resample(orig_freq=48000, new_freq=self.sample_rate)  # Assume worst-case

    def __len__(self):
        return len(self.audio_files)

    def __getitem__(self, idx):
        file_path = os.path.join(self.data_dir, self.audio_files[idx])
        waveform, sample_rate = torchaudio.load(file_path)

        # Resample to 16kHz if needed
        if sample_rate != self.sample_rate:
            resample = T.Resample(orig_freq=sample_rate, new_freq=self.sample_rate)
            waveform = resample(waveform)

        # Convert stereo to mono
        if waveform.shape[0] > 1:
            waveform = waveform.mean(dim=0, keepdim=True)

        # Normalize waveform
        waveform = waveform - waveform.mean()

        # Volume augmentation on waveform
        if self.transform and random.random() < 0.5:
            waveform = T.Vol(gain=(0.5, 1.5), gain_type="amplitude")(waveform)

        # Compute Mel spectrogram and convert to dB
        mel_spec = self.mel_transform(waveform)
        mel_spec = self.db_transform(mel_spec)

        # MinMax normalization to [0, 1]
        mel_min = mel_spec.min()
        mel_max = mel_spec.max()
        mel_spec = (mel_spec - mel_min) / (mel_max - mel_min + 1e-6)

        # Spectrogram-level augmentation
        if self.transform:
            if random.random() < 0.5:
                mel_spec = T.FrequencyMasking(freq_mask_param=15)(mel_spec)
            if random.random() < 0.5:
                mel_spec = T.TimeMasking(time_mask_param=35)(mel_spec)

        # Fix time dimension by padding or cropping
        current_length = mel_spec.shape[-1]
        if current_length < self.target_length:
            pad_amount = self.target_length - current_length
            mel_spec = F.pad(mel_spec, (0, pad_amount))
        else:
            mel_spec = mel_spec[:, :, :self.target_length]

        label = torch.tensor(self.labels[idx], dtype=torch.long)

        # Remove channel dimension if needed (1, 128, T) -> (128, T)
        mel_spec = mel_spec.squeeze(0)

        return mel_spec, label


In [11]:
#this function pads per batch so that every spectogram is the same dimension per batch

def collate_fn(batch):
    spectrograms, labels = zip(*batch)
    
    max_length = max(spec.shape[1] for spec in spectrograms)

    #pad spectrograms to match longest
    spectrograms_padded = [torch.nn.functional.pad(spec, (0, max_length - spec.shape[1])) for spec in spectrograms]

    # Convert list to tensor
    spectrograms_padded = torch.stack(spectrograms_padded)

    labels = torch.tensor(labels, dtype=torch.long)
    return spectrograms_padded, labels

In [13]:
#declaring dataset
dataset = AudioDataset(DATA_PATH)

#calculate training size and testing size
train_size = int(dataset.__len__()*training_split)
test_size = dataset.__len__()-train_size

train_set, test_set = torch.utils.data.random_split(dataset, [train_size, test_size])

train_set.dataset.transform = True
test_set.dataset.transform = False

#dataloaders
train_loader = DataLoader(train_set, batch_size=batch_size, collate_fn=collate_fn, shuffle=True)
test_loader = DataLoader(test_set, batch_size=batch_size, collate_fn=collate_fn, shuffle=False)

#FINAL DIMENSIONS OF SPECS: BatchSize x 128 x MaxTimeLength

# Mamba

In [12]:
import torch
import torch.nn as nn
from mamba_ssm import Mamba
import torchaudio
import torchaudio.transforms as T
from torch.utils.data import DataLoader, Dataset
import os
import random

class PureAudioMamba(nn.Module):  
    def __init__(self, num_classes=6, d_model=256):
        super().__init__()
        self.input_proj = nn.Linear(128, d_model)
        
        self.mamba1 = Mamba(
            d_model=d_model,
            d_state=16,
            d_conv=4,
            expand=2
        )
        self.mamba2 = Mamba(
            d_model=d_model,
            d_state=16,
            d_conv=4,
            expand=2
        )
        
        self.attention_pool = nn.Sequential(
            nn.Linear(d_model, 1),
            nn.Softmax(dim=1)
        )
        self.classifier = nn.Linear(d_model, num_classes)

    def forward(self, x):
        x = x.permute(0, 2, 1)  
        x = self.input_proj(x)  
        
        x = self.mamba1(x)
        x = self.mamba2(x)
        
        attn_weights = self.attention_pool(x)
        x = torch.sum(x * attn_weights, dim=1)
        return self.classifier(x)

def train_model(train_loader, test_loader):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = PureAudioMamba(num_classes=6).to(device)  # Fixed class name
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
    scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, 
                                              max_lr=1e-3,
                                              total_steps=len(train_loader)*100)

    best_acc = 0
    for epoch in range(20):
        model.train()
        train_loss = 0
        for specs, labels in train_loader:
            specs, labels = specs.to(device), labels.to(device)
            
            outputs = model(specs)
            loss = criterion(outputs, labels)
            
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            
            train_loss += loss.item()

        # Validation
        model.eval()
        val_loss = 0
        correct = 0
        total = 0
        with torch.no_grad():
            for specs, labels in test_loader:
                specs, labels = specs.to(device), labels.to(device)
                outputs = model(specs)
                loss = criterion(outputs, labels)
                val_loss += loss.item()
                
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        # Metrics
        train_loss /= len(train_loader)
        val_loss /= len(test_loader)
        val_acc = 100 * correct / total
        
        scheduler.step(val_loss)
        
        print(f"Epoch {epoch+1}: "
              f"Train Loss: {train_loss:.4f} | "
              f"Val Loss: {val_loss:.4f} | "
              f"Val Acc: {val_acc:.2f}%")



train_model(train_loader, test_loader)

# NOTE: mamba uses roughly 3 * expand * d_modeel^2 parameters
# NOTE: AdamW is stochastic optimization that modifes the
# Adam optimizer by decoupling weight decay from gradient update
# NOTE: could also try switching scheduler for cosine with warmup; or lower max_lr

Epoch 1: Train Loss: 1.7834 | Val Loss: 1.7523 | Val Acc: 17.86%
Epoch 2: Train Loss: 1.7337 | Val Loss: 1.7076 | Val Acc: 29.35%
Epoch 3: Train Loss: 1.6784 | Val Loss: 1.6335 | Val Acc: 29.21%
Epoch 4: Train Loss: 1.5606 | Val Loss: 1.5326 | Val Acc: 36.74%
Epoch 5: Train Loss: 1.5035 | Val Loss: 1.4644 | Val Acc: 38.68%
Epoch 6: Train Loss: 1.4868 | Val Loss: 1.4381 | Val Acc: 38.48%
Epoch 7: Train Loss: 1.4705 | Val Loss: 1.4229 | Val Acc: 41.10%
Epoch 8: Train Loss: 1.4516 | Val Loss: 1.4133 | Val Acc: 41.71%
Epoch 9: Train Loss: 1.4399 | Val Loss: 1.4246 | Val Acc: 43.92%
Epoch 10: Train Loss: 1.4149 | Val Loss: 1.4282 | Val Acc: 43.32%
Epoch 11: Train Loss: 1.3965 | Val Loss: 1.3816 | Val Acc: 45.40%
Epoch 12: Train Loss: 1.3914 | Val Loss: 1.4011 | Val Acc: 46.34%
Epoch 13: Train Loss: 1.3815 | Val Loss: 1.4063 | Val Acc: 44.66%
Epoch 14: Train Loss: 1.3729 | Val Loss: 1.3805 | Val Acc: 47.01%
Epoch 15: Train Loss: 1.3577 | Val Loss: 1.3751 | Val Acc: 46.88%
Epoch 16: Train Los

In [14]:
## Test how many Epochs is best

In [13]:
class PureAudioMamba(nn.Module):  
    def __init__(self, num_classes=6, d_model=256):
        super().__init__()
        self.input_proj = nn.Linear(128, d_model)
        self.mamba1 = Mamba(d_model=d_model, d_state=16, d_conv=4, expand=2)
        self.mamba2 = Mamba(d_model=d_model, d_state=16, d_conv=4, expand=2)
        self.attention_pool = nn.Sequential(
            nn.Linear(d_model, 1),
            nn.Softmax(dim=1)
        )
        self.classifier = nn.Linear(d_model, num_classes)

    def forward(self, x):
        x = x.permute(0, 2, 1)
        x = self.input_proj(x)
        x = self.mamba1(x)
        x = self.mamba2(x)
        attn_weights = self.attention_pool(x)
        x = torch.sum(x * attn_weights, dim=1)
        return self.classifier(x)

def train_model(train_loader, test_loader):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = PureAudioMamba(num_classes=6).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
    
    best_val_acc = 0
    best_epoch = 0
    patience = 5  # Stop after 5 epochs without improvement
    early_stop = False
    
    for epoch in range(100):  # Max epochs set high for early stopping
        if early_stop:
            print(f"Early stopping at epoch {epoch+1}")
            break
            
        # Training phase
        model.train()
        train_loss = 0.0
        train_correct = 0
        train_total = 0
        
        for specs, labels in train_loader:
            specs, labels = specs.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(specs)
            loss = criterion(outputs, labels)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            
            # Track training metrics
            train_loss += loss.item()
            _, predicted = torch.max(outputs.data, 1)
            train_total += labels.size(0)
            train_correct += (predicted == labels).sum().item()
        
        # Calculate training metrics
        train_loss /= len(train_loader)
        train_acc = 100 * train_correct / train_total
        
        # Validation phase
        model.eval()
        val_loss = 0.0
        val_correct = 0
        val_total = 0
        
        with torch.no_grad():
            for specs, labels in test_loader:
                specs, labels = specs.to(device), labels.to(device)
                outputs = model(specs)
                loss = criterion(outputs, labels)
                
                val_loss += loss.item()
                _, predicted = torch.max(outputs.data, 1)
                val_total += labels.size(0)
                val_correct += (predicted == labels).sum().item()
        
        # Calculate validation metrics
        val_loss /= len(test_loader)
        val_acc = 100 * val_correct / val_total
        
        # Print metrics
        print(f"Epoch {epoch+1:03d}: "
              f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.2f}% | "
              f"Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.2f}%")
        
        # Early stopping check
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            best_epoch = epoch + 1
        elif (epoch + 1 - best_epoch) >= patience:
            print(f"No improvement for {patience} epochs. Early stopping...")
            early_stop = True
            
    print(f"\nTraining complete. Best validation accuracy: {best_val_acc:.2f}% at epoch {best_epoch}")

# Start training
train_model(train_loader, test_loader)


Epoch 001: Train Loss: 1.7303 | Train Acc: 22.12% | Val Loss: 1.6675 | Val Acc: 26.33%
Epoch 002: Train Loss: 1.5496 | Train Acc: 33.93% | Val Loss: 1.4819 | Val Acc: 34.45%
Epoch 003: Train Loss: 1.5505 | Train Acc: 35.71% | Val Loss: 1.4555 | Val Acc: 37.61%
Epoch 004: Train Loss: 1.4787 | Train Acc: 37.44% | Val Loss: 1.4295 | Val Acc: 39.76%
Epoch 005: Train Loss: 1.4563 | Train Acc: 39.58% | Val Loss: 1.4077 | Val Acc: 41.97%
Epoch 006: Train Loss: 1.4305 | Train Acc: 40.92% | Val Loss: 1.3753 | Val Acc: 43.72%
Epoch 007: Train Loss: 1.4165 | Train Acc: 41.95% | Val Loss: 1.3702 | Val Acc: 42.78%
Epoch 008: Train Loss: 1.3881 | Train Acc: 43.66% | Val Loss: 1.3334 | Val Acc: 46.54%
Epoch 009: Train Loss: 1.3653 | Train Acc: 44.73% | Val Loss: 1.3124 | Val Acc: 47.08%
Epoch 010: Train Loss: 1.3493 | Train Acc: 45.39% | Val Loss: 1.3073 | Val Acc: 46.68%
Epoch 011: Train Loss: 1.3311 | Train Acc: 46.14% | Val Loss: 1.3354 | Val Acc: 45.94%
Epoch 012: Train Loss: 1.3279 | Train Acc: 

# Declaring Models

In [10]:
#all models should accept inputs of differnet lengths (shouldn't have to worry about mamba)
#we should look into using global adaptive pooling

'''
TODO: 

MODELS THAT WE NEED TO MAKE:
CNN-Transformer: Should we use a resnet on this? Would that be overkill? We could use a resnet
    and train it ourselves (not sure if a pretrained resnet would be great)
    
Regular CNN: this will be our base model for comparison. We should play around with this, and
    this should be the same kind of CNN that we use in our other models (i.e. resnet?)
    
Mamba Model: we should train a basic mamba model

Mamba-CNN: we should incorporate a cnn with a mamba model

Pretrained SOTA model: we should delcare a pretrained state of the art model and compare against that
'''
class Sequential_CNN_Transformer(nn.Module):
    def __init__(self):
        super(Sequential_CNN_Transformer, self).__init__()
        
        #declare layers
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)

        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)

        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)

        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

        self.relu = nn.ReLU()
        
        self.global_pool = nn.AdaptiveAvgPool1d(1)
        self.classifier = nn.Linear(128, 6)
        
    def forward(self, x):
        #x = x.unsqueeze(1) #added unsqueeze to the training function instead

        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.pool(x)

        x = self.conv2(x)
        x = self.bn2(x)
        x = self.relu(x)
        x = self.pool(x)

        x = self.conv3(x)
        x = self.bn3(x)
        x = self.relu(x)
        x = self.pool(x)
        
        # flatten for the transformer (batch, sequence, features)
        x = x.flatten(2)
        x = x.permute(0, 2, 1)
        
        # added this to fix dimensionality issues in train_model method
        x = x.permute(0, 2, 1)
        x = self.global_pool(x)
        x = x.squeeze(2)
        x = self.classifier(x)

        return x
        
class Base_CNN(nn.Module):
    def __init__(self):
        super(Base_CNN, self).__init__()
        
        #declare layers
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)

        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)

        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

        self.relu = nn.ReLU()
        
        self.conv4 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm2d(256)
        
        self.global_pool = nn.AdaptiveAvgPool2d((1,1))
        self.max_pool = nn.AdaptiveMaxPool2d((1,1))
        
        self.classifier1 = nn.Linear(256, 64)
        self.classifier2 = nn.Linear(64, 6)
        
        self.residualConv = nn.Conv2d(1, 64, kernel_size=1, stride=2, padding=1)
        
    def forward(self, x):
        residual = x

        x = self.conv1(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.bn1(x)

        x = self.conv2(x)
        
        
        residual = self.residualConv(residual)
        
        x = F.pad(x, (0, 1))
        
        residual = residual[:, :, :x.shape[2], :x.shape[3]]
        
        # print(f"x shape: {x.shape}, res shape: {residual.shape}")
        
        x = x + residual
        
        x = self.relu(x)
        x = self.pool(x)
        x = self.bn2(x)

        x = self.conv3(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.bn3(x)
        
        x = self.conv4(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.bn4(x)
        
        #to fix dimensionality
        x = self.max_pool(x)
        x = torch.flatten(x, 1)
        
        x = self.classifier1(x)
        x = self.classifier2(x)

        return x
    
class Base_CNN_Transformer(nn.Module):
    def __init__(self, transformer_layers=2, n_heads=4, transformer_dim=256):
        super(Base_CNN_Transformer, self).__init__()
        
        # CNN
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)

        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)

        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)

        self.conv4 = nn.Conv2d(128, transformer_dim, kernel_size=3, padding=1)
        self.bn4 = nn.BatchNorm2d(transformer_dim)

        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.relu = nn.ReLU()

        # Transformer Encoder
        encoder_layer = nn.TransformerEncoderLayer(d_model=transformer_dim, nhead=n_heads, dropout=0.2)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=transformer_layers)

        # Classification
        self.global_pool = nn.AdaptiveAvgPool1d(1)
        self.classifier = nn.Sequential(
            nn.Linear(transformer_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 6)
        )
        
    def apply_layernorm(self, x):
        B, C, H, W = x.shape
        return nn.LayerNorm([C, H, W]).to(x.device)(x)

    def forward(self, x):
        # CNN
        x = self.relu(self.apply_layernorm(self.pool(self.conv1(x))))
        x = self.relu(self.apply_layernorm(self.pool(self.conv2(x))))
        x = self.relu(self.apply_layernorm(self.pool(self.conv3(x))))
        x = self.relu(self.apply_layernorm(self.pool(self.conv4(x))))

        # Prepare data
        batch_size, channels, freq, time = x.shape
        x = x.view(batch_size, channels, freq * time)
        x = x.permute(2, 0, 1)  # (seq_length, batch_size, transformer_dim)

        # Transformer
        x = self.transformer_encoder(x)

        x = x.permute(1, 2, 0)  # (batch_size, transformer_dim, seq_length)
        x = self.global_pool(x).squeeze(2)

        # Classification
        x = self.classifier(x)

        return x

# Test Network

In [11]:
dummy_model = Base_CNN_Transformer()

dummy_input = torch.randn(1, 128, 256).unsqueeze(1)

output = dummy_model(dummy_input)

print("Input shape: ", dummy_input.shape)
print("Output shape: ", output.shape)
print(output)

Input shape:  torch.Size([1, 1, 128, 256])
Output shape:  torch.Size([1, 6])
tensor([[-0.0770, -0.0534, -0.1549,  0.0021,  0.2035, -0.2948]],
       grad_fn=<AddmmBackward0>)


In [12]:
print(torch.__version__)  # PyTorch version
print(torch.version.cuda)  # CUDA version
print(torch.backends.cudnn.version())  # cuDNN version

1.13.1+cu116
11.6
8302


# Training

In [13]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

model = Sequential_CNN_Transformer().to(device)

criterion = nn.CrossEntropyLoss().to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

cuda:0


In [14]:
def train_model(model, optimizer, criterion, device, train_loader, num_epochs=10):
    model.to(device)
    model.train()

    for epoch in range(num_epochs):
        running_loss = 0.0

        for i, (inputs, labels) in enumerate(train_loader):
            inputs = inputs.unsqueeze(1)
            inputs, labels = inputs.to(device), labels.to(device)
            

            optimizer.zero_grad()

            outputs = model(inputs)

            loss = criterion(outputs, labels)

            loss.backward()
            optimizer.step()

            running_loss += loss.item()

        print(f"Epoch {epoch + 1}/{num_epochs} Loss: {running_loss:.4f}")

    print("Finished Training")
    return model

def test_model(model, test_loader, device):
    model.to(device)
    model.eval()
    
    correct = 0
    total = 0
    
    # no need for gradients in testing
    with torch.no_grad():
        for data in test_loader:
            inputs, labels = data
            inputs = inputs.unsqueeze(1)
            
            inputs = inputs.to(device)
            labels = labels.to(device)
            
            # calculate outputs by running images through the network
            outputs = model(inputs)
            
            # the class with the highest value is prediction
            _, predicted = torch.max(outputs, 1)
            
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    acc = 100 * correct / total
    return acc

In [31]:
trained_model = train_model(model, optimizer, criterion, device, train_loader)

Epoch 1/10 Loss: 308.6506
Epoch 2/10 Loss: 286.0523
Epoch 3/10 Loss: 274.7413
Epoch 4/10 Loss: 269.3430
Epoch 5/10 Loss: 261.6475
Epoch 6/10 Loss: 255.7968
Epoch 7/10 Loss: 252.8377
Epoch 8/10 Loss: 249.7726
Epoch 9/10 Loss: 247.1257
Epoch 10/10 Loss: 243.8430
Finished Training


In [32]:
acc = test_model(trained_model, test_loader, device)

print(f"Accuracy of the model: {acc:.2f}")

Accuracy of the model: 40.16


In [65]:
resnet = torchvision.models.resnet18(pretrained=False)

# Set to 6 output classes
resnet.fc = nn.Linear(in_features=512, out_features=6)
resnet.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)

for param in resnet.parameters():
    param.requires_grad = True


criterion = nn.CrossEntropyLoss().to(device)

optimizer = torch.optim.Adam(resnet.parameters(), lr=0.001)


In [66]:
resnet = train_model(resnet, optimizer, criterion, device, train_loader, num_epochs=10)

Epoch 1/10 Loss: 286.7953
Epoch 2/10 Loss: 251.3834
Epoch 3/10 Loss: 236.2239
Epoch 4/10 Loss: 221.6431
Epoch 5/10 Loss: 209.7092
Epoch 6/10 Loss: 194.8534
Epoch 7/10 Loss: 182.7004
Epoch 8/10 Loss: 160.8685
Epoch 9/10 Loss: 140.0472
Epoch 10/10 Loss: 115.6093
Finished Training


In [63]:
acc = test_model(resnet, test_loader, device)

print(f"Accuracy of the model: {acc:.2f}")

Accuracy of the model: 44.93


In [11]:
model = Base_CNN().to(device)

criterion = nn.CrossEntropyLoss().to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

In [12]:
model = train_model(model, optimizer, criterion, device, train_loader, num_epochs=20)

Epoch 1/20 Loss: 370.1472
Epoch 2/20 Loss: 259.5503
Epoch 3/20 Loss: 245.5657
Epoch 4/20 Loss: 231.0189
Epoch 5/20 Loss: 216.5164
Epoch 6/20 Loss: 200.5484
Epoch 7/20 Loss: 188.2512
Epoch 8/20 Loss: 164.3314
Epoch 9/20 Loss: 147.4764
Epoch 10/20 Loss: 120.3046
Epoch 11/20 Loss: 105.0427
Epoch 12/20 Loss: 85.1477
Epoch 13/20 Loss: 110.7600
Epoch 14/20 Loss: 45.0547
Epoch 15/20 Loss: 77.1016
Epoch 16/20 Loss: 24.6913
Epoch 17/20 Loss: 41.2945
Epoch 18/20 Loss: 18.8090
Epoch 19/20 Loss: 25.5523
Epoch 20/20 Loss: 50.9678
Finished Training


In [14]:
acc = test_model(model, test_loader, device)

print(f"Accuracy of the model: {acc:.2f}")

Accuracy of the model: 54.67


# Training CNN with Transformer

In [15]:
model = Base_CNN_Transformer().to(device)

criterion = nn.CrossEntropyLoss().to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.0005)

In [16]:
model = train_model(model, optimizer, criterion, device, train_loader, num_epochs=40)

Epoch 1/40 Loss: 320.6334
Epoch 2/40 Loss: 300.9683
Epoch 3/40 Loss: 290.2176
Epoch 4/40 Loss: 285.7240
Epoch 5/40 Loss: 288.5643
Epoch 6/40 Loss: 282.3227
Epoch 7/40 Loss: 280.6535
Epoch 8/40 Loss: 279.6519
Epoch 9/40 Loss: 282.0516
Epoch 10/40 Loss: 277.3430
Epoch 11/40 Loss: 278.3120
Epoch 12/40 Loss: 275.5279
Epoch 13/40 Loss: 273.4604
Epoch 14/40 Loss: 276.0052
Epoch 15/40 Loss: 272.2635
Epoch 16/40 Loss: 269.8875
Epoch 17/40 Loss: 279.0987
Epoch 18/40 Loss: 266.9859
Epoch 19/40 Loss: 270.6569
Epoch 20/40 Loss: 268.8173
Epoch 21/40 Loss: 265.4277
Epoch 22/40 Loss: 262.2337
Epoch 23/40 Loss: 261.1713
Epoch 24/40 Loss: 256.5142
Epoch 25/40 Loss: 261.9590
Epoch 26/40 Loss: 260.8328
Epoch 27/40 Loss: 251.1511
Epoch 28/40 Loss: 246.6781
Epoch 29/40 Loss: 247.1224
Epoch 30/40 Loss: 246.4069
Epoch 31/40 Loss: 241.2922
Epoch 32/40 Loss: 234.1422
Epoch 33/40 Loss: 231.8619
Epoch 34/40 Loss: 231.2366
Epoch 35/40 Loss: 230.4072
Epoch 36/40 Loss: 220.6230
Epoch 37/40 Loss: 217.3497
Epoch 38/4

In [17]:
acc = test_model(model, test_loader, device)

print(f"Accuracy of the model: {acc:.2f}")

Accuracy of the model: 42.38
