## Loading the data

In this step I load only the modelling data which is the cleaned dataset without the demo set so that makes it 7431 audio recordings in total.

In [25]:
import pandas as pd

df = pd.read_csv("modelling_metadata.csv")
df.head()
df.shape

(7431, 10)

## Imports and configuration

In [7]:
import os
import numpy as np

# Audio settings
SAMPLE_RATE = 16000
DURATION_SEC = 3.0
TARGET_SAMPLES = int(SAMPLE_RATE * DURATION_SEC)

# Spectrogram settings
N_FFT = 1024
HOP_LENGTH = 256
N_MELS = 128

# Expected time frames
TARGET_FRAMES = 1 + (TARGET_SAMPLES // HOP_LENGTH)

# Output folders
OUT_NPY_DIR = "mel_npy"
OUT_PNG_DIR = "mel_png"
os.makedirs(OUT_NPY_DIR, exist_ok=True)
os.makedirs(OUT_PNG_DIR, exist_ok=True)

In this step, I define the audio and spectrogram configuration used to convert speech into fixed-size log-mel spectrograms for a CNN + BiLSTM model. I resample all audio to 16 kHz and trim or pad it to a duration of 3 seconds so that every input has the same length. I compute log-mel spectrograms using 128 mel bands and a fixed hop length, which ensures a consistent time dimension required by the BiLSTM. I also create output directories for saving spectrograms as .npy files for training and .png files for visualization, ensuring a clean and reproducible data preparation pipeline.

## Loading and normalizing audio length

In [8]:
import numpy as np
import librosa

def load_and_fix_length(path, sample_rate=SAMPLE_RATE, target_samples=TARGET_SAMPLES):
    y, sr = librosa.load(path, sr=None)

    # ensure mono audio
    if y.ndim > 1:
        y = np.mean(y, axis=0)

    # resampling only if needed
    if sr != sample_rate:
        y = librosa.resample(y, orig_sr=sr, target_sr=sample_rate)

    # padding or trimming to fixed length
    if len(y) < target_samples:
        y = np.pad(y, (0, target_samples - len(y)), mode="constant")
    else:
        y = y[:target_samples]

    return y


I use this function to load each audio file, convert it to mono if necessary, resample it to a fixed sample rate, and pad or trim it to a fixed duration. By enforcing a consistent waveform length, I ensure that all generated log-mel spectrograms have the same shape and are compatible with the CNN + BiLSTM model.

## Why is fixed-length audio required?
Neural networks require inputs of consistent shape. By trimming or padding each audio clip to a fixed duration, I ensure that all resulting mel spectrograms have the same time dimension, which simplifies batching and allows the BiLSTM to process temporal information without variable-length handling.

## Why stereo audio is converted to mono?
Stereo audio is converted to mono to ensure consistent input shape and because spatial information is not relevant for speech emotion recognition. Emotional cues are primarily encoded in pitch, energy and timing rather than channel differences.

## Converting waveform into log-Mel spectrogram

In [9]:
def wav_to_logmel(y, sample_rate=SAMPLE_RATE, n_fft=N_FFT, hop_length=HOP_LENGTH, n_mels=N_MELS):
    S = librosa.feature.melspectrogram(
        y=y,
        sr=sample_rate,
        n_fft=n_fft,
        hop_length=hop_length,
        n_mels=n_mels,
        power=2.0
    )

    S_db = librosa.power_to_db(S, ref=np.max)

    # per-sample standardization
    S_norm = (S_db - S_db.mean()) / (S_db.std() + 1e-6)

    # output shape for PyTorch CNN: (1, n_mels, time)
    return S_norm[np.newaxis, :, :].astype(np.float32)


I use this function to convert a fixed-length audio signal into a log-mel spectrogram that is suitable for CNN + BiLSTM training. By applying log scaling and normalization and returning the spectrogram in a channel-first format, I ensure that all inputs are numerically stable, consistent in shape and directly compatible with PyTorch models.

## Why log scaling is applied?
Neural networks require inputs of consistent shape. By trimming or padding each audio clip to a fixed duration, I ensure that all resulting mel spectrograms have the same time dimension, which simplifies batching and allows the BiLSTM to process temporal information without variable-length handling.

## Saving a PNG preview image

In [10]:
import numpy as np
import matplotlib.pyplot as plt
import librosa.display

def save_logmel_png(mel_img, out_path):
    # mel_img shape: (1, n_mels, time)
    S_img = np.squeeze(mel_img, axis=0)

    plt.figure(figsize=(4, 4))
    librosa.display.specshow(S_img)
    plt.axis("off")
    plt.tight_layout(pad=0)
    plt.savefig(out_path, dpi=150, bbox_inches="tight", pad_inches=0)
    plt.close()


I use this function to save a log-mel spectrogram as a clean PNG image for visualization and inspection. By removing the channel dimension and disabling all axes and padding, I ensure that the saved image contains only the spectrogram content, while keeping the NPY files as the primary input for CNN + BiLSTM training.

## Converting and saving all files

In [11]:
max_files = None
SAVE_PNG = True
LOG_EVERY = 500           # progress print frequency

rows = df if max_files is None else df.head(max_files)

for i, row in enumerate(rows.itertuples(index=False), start=1):
    wav_path = row.file_path
    file_name = os.path.splitext(os.path.basename(wav_path))[0]

    try:
        # loading audio and enforcing fixed length
        y = load_and_fix_length(wav_path)

        # converting to log-mel in the agreed format (1, n_mels, T)
        mel_img = wav_to_logmel(y)

        # saving NPY
        np.save(os.path.join(OUT_NPY_DIR, f"{file_name}.npy"), mel_img)

        # saving PNG
        if SAVE_PNG:
            save_logmel_png(mel_img, os.path.join(OUT_PNG_DIR, f"{file_name}.png"))

    except Exception as e:
        print(f"Skipping file due to error: {wav_path}\n  -> {e}")
        continue

    if i % LOG_EVERY == 0:
        print(f"Processed {i} files...")

print("Saved NPY spectrograms to:", OUT_NPY_DIR)
if SAVE_PNG:
    print("Saved PNG previews to:", OUT_PNG_DIR)


Processed 500 files...
Processed 1000 files...
Processed 1500 files...
Processed 2000 files...
Processed 2500 files...
Processed 3000 files...
Processed 3500 files...
Processed 4000 files...
Processed 4500 files...
Processed 5000 files...
Processed 5500 files...
Processed 6000 files...
Processed 6500 files...
Processed 7000 files...
Saved NPY spectrograms to: mel_npy
Saved PNG previews to: mel_png


I use this block to iterate through the dataset, convert each audio file into a fixed-length log-mel spectrogram, and save it as a .npy file for model training, with optional PNG output for visualization. By overwriting or regenerating files in a controlled way, I ensure that all saved spectrograms follow the same preprocessing rules and are fully consistent with the CNN + BiLSTM pipeline.

## Why are NPY files used instead of images?
Spectrograms are saved as NumPy arrays rather than images to preserve exact numerical values and the true temporal structure of the signal. This format is more suitable for sequence-based models such as BiLSTMs and avoids artifacts introduced by image rendering.

## Why CNN + BiLSTM is used?
The CNN is responsible for learning local spectral patterns from the mel spectrogram, while the BiLSTM models how these patterns evolve over time. This combination allows the system to capture both short-term acoustic features and long-term emotional dynamics.

## Imports

In [38]:
import os
import re
import random
from pathlib import Path

import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader


I import standard Python and PyTorch libraries to handle file paths, numerical data, dataset splitting, and neural network training. NumPy is used to load and manipulate mel spectrograms, while PyTorch provides tensor operations, model definitions and data loading utilities required for training the CNN + BiLSTM model.

## Configurations

In [39]:
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", DEVICE)

# Emotion mapping
EMO_MAP = {
    "ANG": 0,
    "DIS": 1,
    "FEA": 2,
    "HAP": 3,
    "NEU": 4,
    "SAD": 5
}
NUM_CLASSES = len(EMO_MAP)

# Data location
NPY_DIR = "mel_npy"


Device: cpu


I use this block to ensure reproducibility, configure the computation device, and define all global settings required for model training. By fixing random seeds, I make experimental results repeatable, while dynamically selecting the GPU or CPU ensures efficient execution. I also define spectrogram parameters and emotion label mappings that must match the preprocessing stage, providing a consistent and well-organized configuration for training and evaluating the CNN + BiLSTM model on the CREMA-D dataset.

## Collecting all NPY files

In [40]:
npy_paths = sorted([str(p) for p in Path(NPY_DIR).glob("*.npy")])
print("Found NPY files:", len(npy_paths))


Found NPY files: 7431


## Inferring spectrogram shape

In [41]:
example = np.load(npy_paths[0])
if example.ndim == 3:
    _, N_MELS, T_FRAMES = example.shape
else:
    N_MELS, T_FRAMES = example.shape

print("Spectrogram shape:", (1, N_MELS, T_FRAMES))


Spectrogram shape: (1, 128, 188)


I load one example spectrogram to infer the number of mel bands and time frames, ensuring that the model architecture is configured to exactly match the precomputed features.

The spectrogram shape (1, 128, 188) represents a single-channel log-mel spectrogram with 128 frequency bands observed over 188 time steps, giving the model both spectral and temporal information needed for emotion recognition.

## Parsing actor ID and emotion from filename

In [42]:
def parse_actor_and_emotion(filename: str):

    # Actor ID = first number in filename
    actor_match = re.search(r"\d+", filename)
    if actor_match is None:
        raise ValueError(f"No actor ID in {filename}")
    actor_id = actor_match.group(0)

    # Emotion code
    emotion = None
    for emo in EMO_MAP:
        if f"_{emo}_" in filename or filename.startswith(f"{emo}_"):
            emotion = EMO_MAP[emo]
            break

    if emotion is None:
        raise ValueError(f"No emotion code in {filename}")

    return actor_id, emotion


## Actor-independent split

In [44]:
def actor_independent_split(paths, train_ratio=0.7, val_ratio=0.15):
    by_actor = {}

    for p in paths:
        fname = os.path.basename(p)
        actor_id, _ = parse_actor_and_emotion(fname)
        by_actor.setdefault(actor_id, []).append(p)

    actors = list(by_actor.keys())
    random.shuffle(actors)

    n = len(actors)
    n_train = int(n * train_ratio)
    n_val = int(n * val_ratio)

    train_actors = set(actors[:n_train])
    val_actors   = set(actors[n_train:n_train+n_val])
    test_actors  = set(actors[n_train+n_val:])

    def collect(actor_set):
        out = []
        for a in actor_set:
            out.extend(by_actor[a])
        return out

    return (
        collect(train_actors),
        collect(val_actors),
        collect(test_actors)
    )


In [45]:
train_paths, val_paths, test_paths = actor_independent_split(npy_paths)
print(len(train_paths), len(val_paths), len(test_paths))


5141 1066 1224


I use an actor-independent splitting strategy to divide the dataset into training, validation, and test sets. By grouping samples by speaker and ensuring that no actor appears in more than one split, I prevent speaker leakage and obtain a more realistic evaluation of how well the model generalizes to unseen speakers.

## PyTorch Dataset

In [46]:
class MelNPYDataset(Dataset):
    def __init__(self, paths):
        self.paths = paths

    def __len__(self):
        return len(self.paths)

    def __getitem__(self, idx):
        path = self.paths[idx]
        fname = os.path.basename(path)

        # loading spectrogram
        spec = np.load(path).astype(np.float32)
        if spec.ndim == 2:
            spec = spec[np.newaxis, :, :]

        # parsing label from filename
        _, label = parse_actor_and_emotion(fname)

        x = torch.from_numpy(spec)
        y = torch.tensor(label, dtype=torch.long)
        return x, y


I use this Dataset class to load precomputed mel spectrograms from .npy files and extract emotion labels directly from their filenames. Each sample is returned in channel-first format along with its corresponding class label, enabling efficient training without relying on external metadata

## Dataloaders

In [None]:
BATCH_SIZE = 32

train_loader = DataLoader(
    MelNPYDataset(train_paths),
    batch_size=BATCH_SIZE,
    shuffle=True
)

val_loader = DataLoader(
    MelNPYDataset(val_paths),
    batch_size=BATCH_SIZE,
    shuffle=False
)

test_loader = DataLoader(
    MelNPYDataset(test_paths),
    batch_size=BATCH_SIZE,
    shuffle=False
)


This block of code controls how the data is given to the model while it is training and being evaluated. It does not train the model itself. Instead, it decides how many samples are used at a time, in what order they are read, and which data is used for training, validation, and testing.

## What BATCH_SIZE = 32 means?
Setting the batch size to 32 means that the model processes 32 audio samples at once.
Instead of learning from one spectrogram at a time, the model learns from small groups of samples.

After processing one batch, the model updates its internal weights. Then it moves on to the next batch, until it has seen the entire dataset.

## What is a DataLoader?
A DataLoader is a tool that reads data from the dataset, groups samples into batches, optionally shuffles the order of samples and feeds the marches to the model during training.

In [47]:
class CNNBiLSTM(nn.Module):
    def __init__(self, lstm_hidden=128):
        super().__init__()

        self.cnn = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d((2,2)),

            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d((2,1)),

            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(),
            nn.MaxPool2d((2,1))
        )

        with torch.no_grad():
            dummy = torch.zeros(1, 1, N_MELS, T_FRAMES)
            z = self.cnn(dummy)
            C, Fp, Tp = z.shape[1], z.shape[2], z.shape[3]
            lstm_in = C * Fp

        self.lstm = nn.LSTM(
            lstm_in,
            lstm_hidden,
            batch_first=True,
            bidirectional=True
        )

        self.fc = nn.Sequential(
            nn.Dropout(0.3),
            nn.Linear(2*lstm_hidden, 128),
            nn.ReLU(),
            nn.Linear(128, NUM_CLASSES)
        )

    def forward(self, x):
        z = self.cnn(x)
        z = z.permute(0, 3, 1, 2)   # (B, T, C, F)
        z = z.flatten(2)            # (B, T, C*F)
        out, _ = self.lstm(z)
        return self.fc(out[:, -1])
