# CM3070 Computer Science Final Project
# 01 Data Preprocessing

---

## **Table of Contents**  
1. [Introduction](#introduction)  
2. [Objectives](#objectives)  
3. [Setup and dependencies](#setup-and-dependencies)  
   - [Installing dependencies](#installing-dependencies)  
   - [Setting up file paths and output directories](#file-paths-and-output-directories)    
4. [Metadata preparation](#metadata-preparation)  
   - [Setting up emotion maps and inclusion rules](#emotion-maps-and-inclusion-rules)
   - [Collecting all relevant WAV files from the datasets](#collecting-all-relevant-WAV-files)
   - [Extracting the features from a WAV file](#extracting-the-features)
5. [Exporting the files](#exporting-the-files)
   - [Exporting all the files in the dataframe](#exporting-all-the-files-in-the-dataframe)
6. [Summary and next steps](#summary)
7. [References](#references)

---

## 1. Introduction <a id="introduction"></a>

This notebook marks the first stage of a larger project on **Neural Style Transfer (NST) in Speech**, which explores the transformation of spoken audio to adopt the *emotions* of another sample. Inspired by visual style transfer techniques, the goal is to blend the *content* of one speech recording (e.g., neutral narration) with the *emotional tone* or *style* of another (e.g., happy, angry).

To enable this, we use the **RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)** [[1](#reference-1)] and **CREMA-D (Crowd Sourced Emotional Multimodal Actors Dataset)** [[2](#reference-2)] datasets, which are well-suited for prosody-based audio style transfer because they:
- Contain professional-quality speech recordings by 24 actors
- Cover a handful emotions with multiple intensities and lexical content
- Include clean, noise-free recordings suitable for signal processing
- Are publicly available for research use

In this notebook, we begin the preprocessing pipeline by loading, parsing, and converting the audio files from these two datasets into .npy files - a format suitable for training neural models.

---

## 2. Objectives <a id="objectives"></a>

- Load the RAVDESS and CREMA-D  datasets
- Filter all emotions with strong intensity (these strong differences will help the models better distinguish between different emotions)
- Standardize all audio files (e.g., perform log compression, standardise frame rates, etc.)
- Extract prosody features from the audio files (e.g. pitch, energy)
- Save the outputs as .npy files in a structured folder format by emotion category

This preprocessing ensures that our dataset is:
- Clean, consistent, and normalized across all samples
- Structured to support emotion-specific modeling and style transfer
- Ready for downstream tasks such as classification and style transformation using neural networks

The outputs from this notebook form the foundation for the classification models and style transfer techniques developed in later stages.

---

## 3. Setup and dependencies <a id="setup-and-dependencies"></a>
### Installing dependencies <a id="installing-dependencies"></a>
In this cell, we install the necessary Python packages for:

- Audio processing and feature extraction (`librosa`)
- Displaying progress bars during loops (`tqdm`)
- Efficient numerical operations and array handling (`numpy`)

In [1]:
!pip install librosa tqdm numpy --quiet

### Setting up file paths and output directories <a id="file-paths-and-output-directories"></a>
We define paths to the CREMA-D and RAVDESS datasets, and the temporary metadata CSV, creating it if not found. This CSV will be used to collect all the relevant files that are to be further processed into .npy files.

In [2]:
import os

# define base paths
CREMAD_DATA_DIR = "datasets/CREMA-D"
RAVDESS_DATA_DIR = "datasets/RAVDESS/"
CSV_DIR = "csv/"

# create root output directory if it doesn't exist
os.makedirs(CSV_DIR, exist_ok=True)

---

## 4. Metadata preparation <a id="metadata-preparation"></a>
### Setting up emotion maps and inclusion rules<a id="emotion-maps-and-inclusion-rules"></a>

In [3]:
# -------- CREMA-D -------- #
# map CREMA-D emotion codes to labels
cremad_emotion_map = {
    "NEU": "neutral",
    "HAP": "happy",
    "SAD": "sad",
    "ANG": "angry",
    "FEA": "fearful",
    "DIS": "disgust"
}

# precompute valid (actor, statement) pairs that have at least one HI non-neutral clip
cremad_audio_path = os.path.join(CREMAD_DATA_DIR, "AudioWAV")
valid_pairs = set()

for fname in os.listdir(cremad_audio_path):
    if not fname.endswith(".wav"):
        continue
    parts = fname.split("_")
    if len(parts) != 4:
        continue
    actor, statement, emotion_code, intensity_with_ext = parts
    emotion_code = emotion_code.upper()
    intensity = intensity_with_ext.replace(".wav", "").upper()
    if emotion_code in cremad_emotion_map and emotion_code != "NEU" and intensity == "HI":
        valid_pairs.add((actor, statement))

# function to decide if a CREMA-D file should be included
def should_include_cremad_file(emotion_code, intensity, fname=None):
    if emotion_code not in cremad_emotion_map:  # make sure mapping exists
        return False

    if emotion_code == "NEU":
        # only include neutral files if they match a valid actor+statement pair
        if fname is None:
            return False
        parts = fname.split("_")
        if len(parts) != 4:
            return False
        actor, statement, _, _ = parts
        return (actor, statement) in valid_pairs and intensity == "XX"
    else:
        # for other emotions, only keep high intensity
        return intensity == "HI"


# -------- RAVDESS -------- #
# map RAVDESS emotion codes to labels
ravdess_emotion_map = {
    "01": "neutral",
    "03": "happy",
    "04": "sad",
    "05": "angry",
    "06": "fearful",
    "07": "disgust"
}

# function to decide if a RAVDESS file should be included
def should_include_ravdess_file(modality, channel, emotion, intensity):
    if modality != "03" or channel != "01":  # only speech, audio-only files
        return False
    if emotion not in ravdess_emotion_map:  # make sure mapping exists
        return False
    if emotion == "01" and intensity != "01":  # neutral must be normal
        return False
    if emotion != "01" and intensity != "02":  # others must be strong
        return False
    return True

### Collecting all relevant WAV files from the datasets <a id="collecting-all-relevant-WAV-files"></a>

In [4]:
import pandas as pd

metadata = []

# -------- CREMA-D -------- #
cremad_audio_path = os.path.join(CREMAD_DATA_DIR, "AudioWAV")

for file in os.listdir(cremad_audio_path):
    if not file.endswith(".wav"):
        continue

    # Example CREMA-D filename: 1001_DFA_ANG_HI.wav
    parts = file.split("_")
    if len(parts) != 4:
        continue

    actor, statement, emotion_code, intensity_with_ext = parts
    intensity = intensity_with_ext.replace(".wav", "")

    # pass the filename so neutral files can be checked against valid_pairs
    if should_include_cremad_file(emotion_code, intensity, fname=file):
        label = cremad_emotion_map[emotion_code]
        filepath = os.path.join(cremad_audio_path, file)
        metadata.append({
            "filepath": filepath,
            "dataset": "CREMA-D",
            "actor": actor,
            "emotion": label,
            "intensity": intensity
        })

# -------- RAVDESS -------- #
for actor_dir in os.listdir(RAVDESS_DATA_DIR):
    actor_path = os.path.join(RAVDESS_DATA_DIR, actor_dir)
    if not os.path.isdir(actor_path):
        continue

    for file in os.listdir(actor_path):
        if not file.endswith(".wav"):
            continue
            
        # example RAVDESS filename: 03-01-01-01-01-01-01.wav
        parts = file.split("-")
        if len(parts) != 7:
            continue

        modality, vocal_channel, emotion, intensity, statement, repetition, actor = parts

        if should_include_ravdess_file(modality, vocal_channel, emotion, intensity):
            label = ravdess_emotion_map[emotion]
            filepath = os.path.join(actor_path, file)
            metadata.append({
                "filepath": filepath,
                "dataset": "RAVDESS",
                "actor": actor,
                "emotion": label,
                "intensity": intensity
            })

# save metadate onto a dataframe
df = pd.DataFrame(metadata)
print(df.head())
print("\nTotal files collected:", len(df))

# save metadata for later use
df.to_csv(os.path.join(CSV_DIR, "metadata.csv"), index=False)

                                        filepath  dataset actor  emotion  \
0  datasets/CREMA-D\AudioWAV\1001_IEO_ANG_HI.wav  CREMA-D  1001    angry   
1  datasets/CREMA-D\AudioWAV\1001_IEO_DIS_HI.wav  CREMA-D  1001  disgust   
2  datasets/CREMA-D\AudioWAV\1001_IEO_FEA_HI.wav  CREMA-D  1001  fearful   
3  datasets/CREMA-D\AudioWAV\1001_IEO_HAP_HI.wav  CREMA-D  1001    happy   
4  datasets/CREMA-D\AudioWAV\1001_IEO_NEU_XX.wav  CREMA-D  1001  neutral   

  intensity  
0        HI  
1        HI  
2        HI  
3        HI  
4        XX  

Total files collected: 1122


### Extracting the features from a WAV file <a id="extracting-the-features"></a>

In [5]:
import librosa
import numpy as np
import os
from tqdm import tqdm

# -----------------------------
# Settings
# -----------------------------
SAMPLE_RATE = 16000
N_MELS = 80
FIXED_FRAMES = 256  # standard length for time frames
N_FFT = 1024
HOP_LENGTH = 256
N_MFCC = 13  # number of MFCCs

# -----------------------------
# Utility functions
# -----------------------------
def normalize_audio(y):
    """Normalize waveform to [-1, 1]."""
    return y / np.max(np.abs(y)) if np.max(np.abs(y)) > 0 else y

def pad_or_truncate_audio(y, target_samples):
    """Pad or truncate waveform to target number of samples."""
    if len(y) > target_samples:
        return y[:target_samples]
    elif len(y) < target_samples:
        return np.pad(y, (0, target_samples - len(y)), mode='constant')
    return y

def pad_or_truncate_spec(spec, target_frames=FIXED_FRAMES):
    """Pad or truncate 2D or 1D feature arrays to fixed time frames."""
    if spec.ndim == 2:
        if spec.shape[1] > target_frames:
            return spec[:, :target_frames]
        elif spec.shape[1] < target_frames:
            pad_width = target_frames - spec.shape[1]
            return np.pad(spec, ((0, 0), (0, pad_width)), mode='constant', constant_values=1e-6)
    elif spec.ndim == 1:
        if len(spec) > target_frames:
            return spec[:target_frames]
        elif len(spec) < target_frames:
            pad_width = target_frames - len(spec)
            return np.pad(spec, (0, pad_width), mode='constant', constant_values=1e-6)
    return spec

# -----------------------------
# Feature extraction functions
# -----------------------------
def extract_mel_spectrogram(y, sr, n_mels=N_MELS, target_frames=FIXED_FRAMES):
    mel = librosa.feature.melspectrogram(
        y=y, sr=sr, n_mels=n_mels,
        n_fft=N_FFT, hop_length=HOP_LENGTH,
        power=1.0
    )
    mel = np.log(mel + 1e-9)
    mel = pad_or_truncate_spec(mel, target_frames)
    return mel.astype(np.float32)

def extract_mfcc(y, sr, n_mfcc=N_MFCC, target_frames=FIXED_FRAMES):
    mfcc = librosa.feature.mfcc(
        y=y, sr=sr, n_mfcc=n_mfcc,
        n_fft=N_FFT, hop_length=HOP_LENGTH
    )
    # Compute delta and delta-delta
    mfcc_delta = librosa.feature.delta(mfcc)
    mfcc_delta2 = librosa.feature.delta(mfcc, order=2)
    mfcc_features = np.vstack([mfcc, mfcc_delta, mfcc_delta2])
    mfcc_features = pad_or_truncate_spec(mfcc_features, target_frames)
    return mfcc_features.astype(np.float32)

def extract_pitch(y, sr, target_frames=FIXED_FRAMES):
    f0, _, _ = librosa.pyin(
        y, fmin=librosa.note_to_hz('C2'),
        fmax=librosa.note_to_hz('C7')
    )
    f0 = np.nan_to_num(f0)
    f0 = pad_or_truncate_spec(f0, target_frames)
    return f0.astype(np.float32)

def extract_energy(y, frame_length=2048, hop_length=512, target_frames=FIXED_FRAMES):
    energy = librosa.feature.rms(y=y, frame_length=frame_length, hop_length=hop_length)[0]
    energy = pad_or_truncate_spec(energy, target_frames)
    return energy.astype(np.float32)

def extract_spectral_features(y, sr, target_frames=FIXED_FRAMES):
    # Spectral centroid
    centroid = librosa.feature.spectral_centroid(y=y, sr=sr, n_fft=N_FFT, hop_length=HOP_LENGTH)[0]
    centroid = pad_or_truncate_spec(centroid, target_frames)

    # Spectral bandwidth
    bandwidth = librosa.feature.spectral_bandwidth(y=y, sr=sr, n_fft=N_FFT, hop_length=HOP_LENGTH)[0]
    bandwidth = pad_or_truncate_spec(bandwidth, target_frames)

    # Spectral contrast
    contrast = librosa.feature.spectral_contrast(y=y, sr=sr, n_fft=N_FFT, hop_length=HOP_LENGTH)
    contrast = pad_or_truncate_spec(contrast, target_frames)

    spectral_feats = np.vstack([centroid, bandwidth, contrast])
    return spectral_feats.astype(np.float32)

def extract_duration(y, sr):
    return len(y) / sr

# -----------------------------
# Main processing function
# -----------------------------
def process_and_save_features(file_path, output_dir, emotion_label,
                              sample_rate=SAMPLE_RATE, n_mels=N_MELS, target_frames=FIXED_FRAMES):
    # Load and normalize waveform
    y, sr = librosa.load(file_path, sr=sample_rate)
    y = normalize_audio(y)

    # Pad waveform to ensure enough frames
    target_samples = (target_frames - 1) * HOP_LENGTH + N_FFT
    y = pad_or_truncate_audio(y, target_samples)

    # Extract features
    mel_db = extract_mel_spectrogram(y, sr, n_mels, target_frames)
    mfcc_feats = extract_mfcc(y, sr, N_MFCC, target_frames)
    f0 = extract_pitch(y, sr, target_frames)
    energy = extract_energy(y, target_frames=target_frames)
    spectral_feats = extract_spectral_features(y, sr, target_frames)
    duration = extract_duration(y, sr)

    # Save features
    features = {
        "mel": mel_db,
        "mfcc": mfcc_feats,
        "f0": f0,
        "energy": energy,
        "spectral": spectral_feats,
        "duration": np.float32(duration),
        "sr": np.int32(sr),
        "emotion": emotion_label
    }

    out_dir = os.path.join(output_dir, emotion_label)
    os.makedirs(out_dir, exist_ok=True)
    file_id = os.path.splitext(os.path.basename(file_path))[0]
    out_path = os.path.join(out_dir, f"{file_id}.npy")
    np.save(out_path, features)

---

## 5. Exporting the files <a id="exporting-the-files"></a>
### Exporting all the files in the dataframe<a id="exporting-all-the-files-in-the-dataframe"></a>

In [6]:
def preprocess_dataset(df, output_dir="npys"):
    """
    Given df with columns [filepath, emotion],
    extract and save features as .npy files.
    """
    for _, row in tqdm(df.iterrows(), total=len(df)):
        process_and_save_features(row["filepath"], output_dir, row["emotion"])

preprocess_dataset(df)

100%|██████████| 1122/1122 [12:02<00:00,  1.55it/s]


---

## 6. Summary and next steps <a id="summary"></a>
### Summary
The relevant audio files have been exported as .npy files and are now organized into the folder:
- `npys/`

which each contain 6 emotion sub-folders:
- `angry`
- `disgust`
- `fearful`
- `happy`
- `neutral`
- `sad`

### Next steps
We will use these .npy files as inputs to train a neural network for emotion classification, to help identify different emotions.
After that, we will use the prosody features captured (pitch, energy) within the .npy files to help transform a neutral input.

---

## 7. References <a id="references"></a>
[1] <a id="reference-1"></a> Affective Data Science Lab (ADSL), 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [Online] 
Available at: https://zenodo.org/records/1188976
[Accessed 8 June 2025].


[2] <a id="reference-2"></a>  Cao, H., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A. & Verma, R., 2014. CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset. [Online] 
Available at: https://pmc.ncbi.nlm.nih.gov/articles/PMC4313618/
[Accessed 8 August 2025].