## Approach B

### Data path
You need to set this up yourself. 

## Data (not augmented)

This section introduces the data handling approach for the project.
The code defines a custom PyTorch Dataset class, `AccentSpectrogramDataset`, which loads audio files from a specified folder,
applies optional mel-spectrogram transformations, and returns the resulting spectrogram along with the accent label and gender extracted from the filename. This dataset is designed to be used for training or evaluating models
on accent classification tasks using spectrogram representations of audio.


In [1]:
import os
import torch
import torchaudio
import torchaudio.transforms as T
from torch.utils.data import Dataset

class AccentSpectrogramDataset(Dataset):
    def __init__(self, folder_path,
                 target_sr: int = 16000,
                 use_mel: bool = False,
                 n_fft: int = 400,
                 hop_length: int = None,
                 n_mels: int = 64,
                 log_scale: bool = True):
        # store file paths only; transform per item
        self.file_paths = [
            os.path.join(folder_path, f)
            for f in os.listdir(folder_path)
            if f.endswith('.wav')
        ]
        self.target_sr = target_sr
        self.use_mel = use_mel
        self.n_fft = n_fft
        self.hop_length = hop_length or n_fft // 2
        self.n_mels = n_mels
        self.log_scale = log_scale

        # pre-configure transform funct
        if self.use_mel:
            self._transform = lambda w: T.MelSpectrogram(
                sample_rate=self.target_sr,
                n_fft=self.n_fft,
                hop_length=self.hop_length,
                n_mels=self.n_mels
            )(w)
        else:
            self._transform = lambda w: T.Spectrogram(
                n_fft=self.n_fft,
                hop_length=self.hop_length
            )(w)

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        path = self.file_paths[idx]
        waveform, sr = torchaudio.load(path)
        if sr != self.target_sr:
            waveform = T.Resample(sr, self.target_sr)(waveform)

        spec = self._transform(waveform)
        if self.log_scale:
            spec = torch.log(spec + 1e-6)

        fname = os.path.basename(path)
        accent = int(fname[0]) - 1          # classes 0–4
        gender = fname[1]  # 'm' or 'f' 
        return spec, accent, gender

### Pad_collate

This section defines a custom collate function called `pad_collate` for use with a PyTorch DataLoader.
The function takes a batch of tuples (each containing a spectrogram, accent label, and gender),
pads or crops each spectrogram in the batch to a fixed width (`target_width`, default 208),
and returns a batch of stacked spectrogram tensors, a tensor of accent labels, and a list of gender strings.
This ensures that all spectrograms in a batch have the same shape, which is required for efficient batching in PyTorch.


In [2]:
import torch.nn.functional as F
def pad_collate(batch, target_width=208):
    specs, accents, genders = zip(*batch)
    padded_specs = []
    for s in specs:
        pad_amount = target_width - s.shape[-1]
        if pad_amount > 0:
            padded = torch.nn.functional.pad(s, (0, pad_amount))
        else:
            padded = s[..., :target_width]
        padded_specs.append(padded)
    return (
        torch.stack(padded_specs),
        torch.tensor(accents),
        list(genders)   # <--- returns a list of 'm'/'f'
    )


Inspecting a sample

In [3]:
#dataset = AccentSpectrogramDataset("/Users/larsheijnen/DL/Train")
dataset = AccentSpectrogramDataset("/Users/larsheijnen/DL/Train")
print(f"Total samples: {len(dataset)}")

# Look at shape of first spectrogram
x, y, z= dataset[6]
print(f"Spectrogram shape: {x.shape}")
print(f"Label: {y}")
print(f"Gender: {z}")

Total samples: 3166
Spectrogram shape: torch.Size([1, 201, 526])
Label: 1
Gender: m


### Data loader

This section sets up a PyTorch DataLoader to efficiently batch and shuffle data, using a small batch_size and a custom pad_collate function to pad or crop spectrograms so all inputs in a batch have the same shape. The resulting batches contain stacked spectrogram tensors, accent labels, and gender labels, ready for use in training or evaluation.

In [4]:
from torch.utils.data import DataLoader

# Use batch_size=4 for low RAM, pin_memory is False for macOS/MPS
dataloader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=pad_collate, pin_memory=False)

# Try again
for batch in dataloader:
    spectrograms, accents, gender = batch
    print(f"Spectrograms: {spectrograms.shape}")  # (B, 1, F, T)
    print(f"Accents: {accents}")                  # (B,)
    print(f"Gender: {gender}")
    break

Spectrograms: torch.Size([4, 1, 201, 208])
Accents: tensor([0, 1, 0, 1])
Gender: ['m', 'f', 'f', 'f']


### Models

#### Overview

This code defines six variations of a simple Convolutional Neural Network (CNN) architecture in PyTorch, designed for multi-class classification (default: 5 classes, such as accent or gender prediction from spectrograms). Each model uses three convolutional layers, a global pooling layer to standardize feature map sizes, and a fully connected (linear) output layer. The variations explore the effects of batch normalization and dropout regularization on training stability and overfitting.

---

#### Model Structure

**Shared Core Design (all models):**

- **Convolutional Layers (`Conv2d`):**  
  Three stacked 2D convolutional layers with increasing channel depth (8 → 16 → 32), kernel size 3×3, and padding to keep feature map sizes consistent.

- **Activation (`ReLU`):**  
  Each convolution is followed by a ReLU activation for non-linearity.

- **Pooling (`AdaptiveAvgPool2d`):**  
  Shrinks feature maps to a fixed 16×16 size, allowing input spectrograms of varying dimensions.

- **Flattening:**  
  The pooled output is flattened to a 1D vector.

- **Fully Connected Layer (`Linear`):**  
  Maps extracted features to the desired number of classes.

---

#### Model Variants

1. **CNNBaseline**  
   The simplest model: just convolutions, activations, pooling, and a final linear layer.

2. **CNNBaseline_BatchNorm**  
   Adds batch normalization after each convolution (not after activation in the code), which stabilizes and speeds up training by normalizing layer inputs.

3. **CNNBaseline_Dropout3**  
   Adds a dropout layer (dropout probability 0.3) before the final linear layer to randomly zero some activations during training, helping prevent overfitting.

4. **CNNBaseline_Dropout5**  
   Same as above but with a higher dropout rate (0.5) for stronger regularization.

5. **CNNBaseline_Dropout3_BatchNorm**  
   Combines both batch normalization after each convolution and dropout (0.3) before the final layer.

6. **CNNBaseline_Dropout5_BatchNorm**  
   Combines batch normalization with a higher dropout rate (0.5).

---

#### Forward Pass Flow (for all models)

1. **Input:**  
   Receives a batch of spectrograms (e.g., `[B, 1, F, T]`).

2. **Conv → ReLU → [BatchNorm]:**  
   Processes input through three convolutional layers with ReLU activations; some models also normalize with batch normalization.

3. **Pooling:**  
   Reduces output to a fixed 16×16 feature map.

4. **Dropout (if included):**  
   Applies dropout regularization before classification.

5. **Flatten and Classify:**  
   Flattens the pooled feature map and passes it to the fully connected layer for prediction.

---

#### Why so many models?

Testing these variations helps determine which combination of normalization and regularization yields the best training stability and generalization for your particular data and task.


In [5]:
import torch.nn as nn
import torch.nn.functional as F

#Model 1 (baseline)
class CNNBaseline(nn.Module):
    def __init__(self, num_classes: int = 5):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 8, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
    
        self.pool = nn.AdaptiveAvgPool2d((16, 16))  
        self.fc = nn.Linear(32 * 16 * 16, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)
    
#Model 2 (baseline + batch normalization)
class CNNBaseline_BatchNorm(nn.Module):
    def __init__(self, num_classes: int = 5):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 8, kernel_size=3, padding=1)
        self.bn1   = nn.BatchNorm2d(8)
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, padding=1)
        self.bn2   = nn.BatchNorm2d(16)
        self.conv3 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.bn3   = nn.BatchNorm2d(32)
        
        self.pool = nn.AdaptiveAvgPool2d((16, 16))  
        self.fc = nn.Linear(32 * 16 * 16, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)
    
#Model 3 (baseline + dropout 0.3)
class CNNBaseline_Dropout3(nn.Module):
    def __init__(self, num_classes: int = 5, dropout_p: float = 0.3):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 8, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        
        self.pool = nn.AdaptiveAvgPool2d((16, 16))  
        self.dropout = nn.Dropout(dropout_p)
        self.fc = nn.Linear(32 * 16 * 16, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)
    
#Model 4 (baseline + dropout 0.5)
class CNNBaseline_Dropout5(nn.Module):
    def __init__(self, num_classes: int = 5, dropout_p: float = 0.5):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 8, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        
        self.pool = nn.AdaptiveAvgPool2d((16, 16))  
        self.dropout = nn.Dropout(dropout_p) 
        self.fc = nn.Linear(32 * 16 * 16, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)

#Model 5 (baseline + bacth normalization + dropout 0.3)
class CNNBaseline_Dropout3_BatchNorm(nn.Module):
    def __init__(self, num_classes: int = 5, dropout_p: float = 0.3):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 8, kernel_size=3, padding=1)
        self.bn1   = nn.BatchNorm2d(8)
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, padding=1)
        self.bn2   = nn.BatchNorm2d(16)
        self.conv3 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.bn3   = nn.BatchNorm2d(32)

        self.pool = nn.AdaptiveAvgPool2d((16, 16))  
        self.dropout = nn.Dropout(dropout_p) 
        self.fc = nn.Linear(32 * 16 * 16, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)

#Model 6 (baseline + bacth normalization + dropout 0.5)
class CNNBaseline_Dropout5_BatchNorm(nn.Module):
    def __init__(self, num_classes: int = 5, dropout_p: float = 0.5):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 8, kernel_size=3, padding=1)
        self.bn1   = nn.BatchNorm2d(8)
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, padding=1)
        self.bn2   = nn.BatchNorm2d(16)
        self.conv3 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.bn3   = nn.BatchNorm2d(32)

        self.pool = nn.AdaptiveAvgPool2d((16, 16))  
        self.dropout = nn.Dropout(dropout_p) 
        self.fc = nn.Linear(32 * 16 * 16, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)


### Models_dict

This code creates a dictionary, models_dict, that links string keys (like "Model1", "Model2", etc.) to each CNN model class defined earlier.

In [6]:
models_dict = {
    "Model1": CNNBaseline,
    "Model2": CNNBaseline_BatchNorm, 
    "Model3": CNNBaseline_Dropout3,
    "Model4": CNNBaseline_Dropout5,
    "Model5": CNNBaseline_Dropout3_BatchNorm,
    "Model6": CNNBaseline_Dropout5_BatchNorm,}

This script prepares a dataset of accent spectrograms, splits it into training and testing sets, and uses DataLoader for batching. It defines functions to evaluate model performance overall and by gender, trains each CNN model variant for several epochs, prints detailed metrics at each epoch, saves the trained models, and outputs both overall and gender-specific classification reports for each model.

### Training models, non-augmented, early stop

In [7]:
import torch
from torch.utils.data import DataLoader, random_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
import os

# Prepare dataset & split (non-augmented)
dataset = AccentSpectrogramDataset(
    '/Users/larsheijnen/DL/Train',
    target_sr=16000,
    use_mel=True,
    n_fft=1024,
    hop_length=256,
    n_mels=64,
    log_scale=True
)

train_len = int(0.8 * len(dataset))
test_len  = len(dataset) - train_len
train_ds, test_ds = random_split(dataset, [train_len, test_len], generator=torch.Generator().manual_seed(42))

train_loader = DataLoader(train_ds, batch_size=4, shuffle=True,  collate_fn=pad_collate)
test_loader  = DataLoader(test_ds,  batch_size=4, shuffle=False, collate_fn=pad_collate)

device    = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

# General (not by gender) evaluation helper
def evaluate(loader, model, device):
    model.eval()
    all_preds, all_labels = [], []
    with torch.no_grad():
        for specs, labels, _ in loader:
            specs, labels = specs.to(device), labels.to(device)
            outputs = model(specs)
            preds = outputs.argmax(dim=1)
            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(labels.cpu().tolist())
    acc    = accuracy_score(all_labels, all_preds)
    prec   = precision_score(all_labels, all_preds, average='macro', zero_division=0)
    recall = recall_score(all_labels, all_preds, average='macro')
    f1     = f1_score(all_labels, all_preds, average='macro')
    return acc, prec, recall, f1

# Gender-based evaluation helper
def evaluate_by_gender(loader, model, device):
    model.eval()
    all_preds, all_labels, all_genders = [], [], []
    with torch.no_grad():
        for specs, labels, genders in loader:
            specs, labels = specs.to(device), labels.to(device)
            outputs = model(specs)
            preds = outputs.argmax(dim=1)
            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(labels.cpu().tolist())
            all_genders.extend(genders)
    results = {}
    for gender in ['m', 'f']:
        idxs = [i for i, g in enumerate(all_genders) if g == gender]
        gender_preds = [all_preds[i] for i in idxs]
        gender_labels = [all_labels[i] for i in idxs]
        acc = accuracy_score(gender_labels, gender_preds)
        prec = precision_score(gender_labels, gender_preds, average='macro', zero_division=0)
        recall = recall_score(gender_labels, gender_preds, average='macro')
        f1 = f1_score(gender_labels, gender_preds, average='macro')
        results[gender] = {'accuracy': acc, 'precision': prec, 'recall': recall, 'f1': f1}
    return results

def classification_report_for_model(model, loader, device):
    model.eval()
    all_preds, all_labels = [], []
    with torch.no_grad():
        for specs, labels, _ in loader:
            specs, labels = specs.to(device), labels.to(device)
            outputs = model(specs)
            preds = outputs.argmax(dim=1)
            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(labels.cpu().tolist())
    print(classification_report(all_labels, all_preds, digits=3))

# Early stopping parameters
patience = 20
max_epochs = 150
min_improvement = 0.005  # 0.5%

save_dir_base = "/Users/larsheijnen/DL/saved_models/B/not_augmented_earlystop"
os.makedirs(save_dir_base, exist_ok=True)

for model_name, model_class in models_dict.items():
    model = model_class().to(device)
    print(f"\n=== Training model: {type(model).__name__} (Early Stopping, Not Augmented) ===")
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

    best_test_acc = 0.0
    patience_counter = 0

    best_model_path = os.path.join(save_dir_base, f"{type(model).__name__}_not_augmented_best_earlystop.pth")
    final_model_path = os.path.join(save_dir_base, f"{type(model).__name__}_not_augmented_final_earlystop.pth")

    for epoch in range(max_epochs):
        model.train()
        running_loss = 0.0
        for specs, labels, genders in train_loader:
            specs, labels = specs.to(device), labels.to(device)
            optimizer.zero_grad()
            loss = criterion(model(specs), labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        # Compute and print general metrics for this epoch (not by gender)
        train_acc, train_prec, train_recall, train_f1 = evaluate(train_loader, model, device)
        test_acc, test_prec, test_recall, test_f1 = evaluate(test_loader, model, device)
        print(
            f"Epoch {epoch+1:3d}/{max_epochs} | "
            f"Train Loss: {running_loss:.3f} | "
            f"Train Acc: {train_acc*100:5.2f}% | "
            f"Train Prec: {train_prec*100:5.2f}% | "
            f"Train Recall: {train_recall*100:5.2f}% | "
            f"Train F1: {train_f1*100:5.2f}% || "
            f"Test Acc: {test_acc*100:5.2f}% | "
            f"Test Prec: {test_prec*100:5.2f}% | "
            f"Test Recall: {test_recall*100:5.2f}% | "
            f"Test F1: {test_f1*100:5.2f}% | "
            f"Patience: {patience_counter}/{patience}"
        )

        # Early stopping logic
        if test_acc > best_test_acc + min_improvement:
            best_test_acc = test_acc
            patience_counter = 0
            torch.save(model.state_dict(), best_model_path)
            print(f"    → New best test accuracy: {best_test_acc*100:.3f}% (saved to {best_model_path})")
        else:
            patience_counter += 1

        if patience_counter >= patience:
            print(f"\nEarly stopping triggered for {type(model).__name__} after {epoch+1} epochs.")
            break

    # Save final model state
    torch.save(model.state_dict(), final_model_path)
    print(f"Final training state for {type(model).__name__} saved to {final_model_path}")

    print(f"\nTraining completed for {type(model).__name__} after {epoch+1} epochs.")
    print(f"Best test accuracy achieved during training: {best_test_acc*100:.3f}%")

    # Load the best model for final evaluation (if it was saved)
    if os.path.exists(best_model_path):
        print(f"\nLoading best saved model for {type(model).__name__} from {best_model_path} for final evaluation...")
        model.load_state_dict(torch.load(best_model_path, map_location=device))
        eval_model_description = "best saved"
    else:
        print(f"\nNo best model was saved for {type(model).__name__} (best accuracy did not sufficiently improve). Using final model state for evaluation.")
        eval_model_description = "final"

    print(f"\nClassification Report for {type(model).__name__} (using {eval_model_description} model):")
    classification_report_for_model(model, test_loader, device)

    print(f"\nGender breakdown for {type(model).__name__} (using {eval_model_description} model):")
    gender_results = evaluate_by_gender(test_loader, model, device)
    for gender in gender_results:
        label = "Male" if gender == "m" else "Female"
        print(f"{label}: {gender_results[gender]}")

    # Final summary for the current model
    final_train_acc_loaded_model, _, _, _ = evaluate(train_loader, model, device)
    print(f"\n--- Summary for {type(model).__name__} ---")
    print(f"- Total epochs trained: {epoch+1}")
    print(f"- Best validation accuracy during training: {best_test_acc*100:.3f}%")
    print(f"- Training accuracy of loaded ({eval_model_description}) model: {final_train_acc_loaded_model*100:.2f}%")
    if os.path.exists(best_model_path):
        print(f"- Best model saved to: {best_model_path}")
    else:
        print(f"- Best model not saved (or final model is the best achieved). Final model at: {final_model_path}")
    print(f"---------------------------------------\n")

print("\n\nAll model configurations have been trained and evaluated (non-augmented, early stopping).")


=== Training model: CNNBaseline (Early Stopping, Not Augmented) ===
Epoch   1/150 | Train Loss: 934.903 | Train Acc: 60.86% | Train Prec: 68.42% | Train Recall: 60.90% | Train F1: 58.68% || Test Acc: 53.31% | Test Prec: 61.30% | Test Recall: 53.43% | Test F1: 51.01% | Patience: 0/20
    → New best test accuracy: 53.312% (saved to /Users/larsheijnen/DL/saved_models/B/not_augmented_earlystop/CNNBaseline_not_augmented_best_earlystop.pth)
Epoch   2/150 | Train Loss: 566.557 | Train Acc: 78.12% | Train Prec: 80.79% | Train Recall: 78.12% | Train F1: 77.60% || Test Acc: 66.09% | Test Prec: 70.33% | Test Recall: 66.31% | Test F1: 65.77% | Patience: 0/20
    → New best test accuracy: 66.088% (saved to /Users/larsheijnen/DL/saved_models/B/not_augmented_earlystop/CNNBaseline_not_augmented_best_earlystop.pth)
Epoch   3/150 | Train Loss: 280.662 | Train Acc: 86.85% | Train Prec: 89.16% | Train Recall: 86.32% | Train F1: 86.27% || Test Acc: 76.50% | Test Prec: 80.66% | Test Recall: 76.28% | Test F

### Predicting acccent on Test data using non-augmented models
This code loads a separate test dataset of accent spectrograms from a specified folder using the same preprocessing settings as training, creating a test_loader to batch the data for evaluation without shuffling. This allows for consistent, reproducible testing of model performance on new, unseen data.

In [8]:
testset_folder = "/Users/larsheijnen/DL/Test set"
test_dataset = AccentSpectrogramDataset(
    testset_folder,
    target_sr=16000,
    use_mel=True,
    n_fft=1024,
    hop_length=256,
    n_mels=64,
    log_scale=True
)
test_loader = DataLoader(test_dataset, batch_size=4, shuffle=False, collate_fn=pad_collate)

This code dynamically sets the directory containing saved model weights (.pth files), lists all saved model files in that directory, and creates a mapping from each filename to its corresponding model class by checking the filename prefix.

In [9]:
import os
import torch

# Dynamically determine the saved models directory relative to this script or notebook
base_dir = os.path.dirname(os.path.abspath('assignment_A.ipynb'))  # or __file__ if in .py
saved_models_dir = os.path.join(base_dir, "saved_models", "B", "not_augmented")

# List all .pth files in the directory
model_files = [f for f in os.listdir(saved_models_dir) if f.endswith(".pth")]

# Map model file names to their classes (assumes naming convention: class name is prefix before first underscore or before '_latest')
model_classes = {}
for fname in model_files:
    if fname.startswith("CNNBaseline_Dropout3_BatchNorm"):
        model_classes[fname] = CNNBaseline_Dropout3_BatchNorm
    elif fname.startswith("CNNBaseline_Dropout5_BatchNorm"):
        model_classes[fname] = CNNBaseline_Dropout5_BatchNorm
    elif fname.startswith("CNNBaseline_Dropout3"):
        model_classes[fname] = CNNBaseline_Dropout3
    elif fname.startswith("CNNBaseline_Dropout5"):
        model_classes[fname] = CNNBaseline_Dropout5
    elif fname.startswith("CNNBaseline_BatchNorm"):
        model_classes[fname] = CNNBaseline_BatchNorm
    elif fname.startswith("CNNBaseline"):
        model_classes[fname] = CNNBaseline

This function predicts accent classes for each sample in the test set by passing batches through the model and collecting the predicted labels along with their corresponding filenames. It returns a list of tuples, where each tuple contains a filename and its predicted accent class.

In [10]:
def predict_accent_on_testset(model, test_loader, device):
    model.eval()
    all_preds = []
    all_fnames = []
    with torch.no_grad():
        for i, (specs, _, _) in enumerate(test_loader):  # gender is ignored
            specs = specs.to(device)
            outputs = model(specs)
            preds = outputs.argmax(dim=1).cpu().tolist()
            all_preds.extend(preds)
            # Get filenames for this batch
            batch_indices = range(i * test_loader.batch_size, i * test_loader.batch_size + len(preds))
            fnames = [os.path.basename(test_dataset.file_paths[idx]) for idx in batch_indices]
            all_fnames.extend(fnames)
    return list(zip(all_fnames, all_preds))

This code iterates over each saved model file and its corresponding model class, loads the model weights onto the appropriate device, and sets the model to evaluation mode. It then runs predictions on the test set and prints the filename along with the predicted accent class for each sample.

In [11]:
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

for model_file, model_class in model_classes.items():
    model = model_class().to(device)
    model_path = os.path.join(saved_models_dir, model_file)
    model.load_state_dict(torch.load(model_path, map_location=device))
    model.eval()
    print(f"\nPredictions for model: {model_file}")
    results = predict_accent_on_testset(model, test_loader, device)
    for fname, pred in results:
        print(f"File: {fname} | Predicted Accent: {pred}")


Predictions for model: CNNBaseline_Dropout5_BatchNorm_not_augmented_latest_2d.pth
File: 9430.wav | Predicted Accent: 3
File: 4458.wav | Predicted Accent: 4
File: 1534.wav | Predicted Accent: 0
File: 8510.wav | Predicted Accent: 1
File: 7192.wav | Predicted Accent: 2
File: 2607.wav | Predicted Accent: 2
File: 1468.wav | Predicted Accent: 3
File: 5626.wav | Predicted Accent: 1
File: 9949.wav | Predicted Accent: 2
File: 5815.wav | Predicted Accent: 1
File: 6105.wav | Predicted Accent: 0
File: 4060.wav | Predicted Accent: 3
File: 4048.wav | Predicted Accent: 2
File: 8855.wav | Predicted Accent: 0
File: 7232.wav | Predicted Accent: 0
File: 8101.wav | Predicted Accent: 3
File: 8115.wav | Predicted Accent: 4
File: 7540.wav | Predicted Accent: 3
File: 8673.wav | Predicted Accent: 1
File: 2438.wav | Predicted Accent: 3
File: 9974.wav | Predicted Accent: 3
File: 7781.wav | Predicted Accent: 4
File: 8465.wav | Predicted Accent: 0
File: 9747.wav | Predicted Accent: 3
File: 8459.wav | Predicted Ac

## Check models on train data

This code creates a smaller, random subset of 100 samples from the full training dataset by selecting random indices and using PyTorch’s Subset class. It then prepares a DataLoader for this subset, allowing for efficient batching and iteration over just these selected samples.

In [12]:
from torch.utils.data import Subset
import numpy as np

trainset_folder = "/Users/larsheijnen/DL/Train"
full_train_dataset = AccentSpectrogramDataset(
    trainset_folder,
    target_sr=16000,
    use_mel=True,
    n_fft=1024,
    hop_length=256,
    n_mels=64,
    log_scale=True
)

# Randomly select 100 indices
np.random.seed(42)
subset_indices = np.random.choice(len(full_train_dataset), size=100, replace=False)
subset_dataset = Subset(full_train_dataset, subset_indices)
subset_loader = DataLoader(subset_dataset, batch_size=4, shuffle=False, collate_fn=pad_collate)

This function evaluates a model on a provided data loader containing a subset of the dataset, collecting the predicted and true labels for each sample, along with their filenames. It returns a list of tuples, where each tuple contains the filename, ground truth label, and predicted label, enabling detailed analysis of model performance on this specific subset.

In [13]:
def evaluate_on_subset(model, loader, device):
    model.eval()
    all_preds = []
    all_labels = []
    all_fnames = []
    with torch.no_grad():
        for i, (specs, labels, _) in enumerate(loader):  # ignore gender
            specs = specs.to(device)
            outputs = model(specs)
            preds = outputs.argmax(dim=1).cpu().tolist()
            all_preds.extend(preds)
            all_labels.extend(labels.tolist())
            # Get filenames for this batch
            batch_indices = range(i * loader.batch_size, i * loader.batch_size + len(preds))
            fnames = [os.path.basename(full_train_dataset.file_paths[idx]) for idx in subset_indices[batch_indices.start:batch_indices.stop]]
            all_fnames.extend(fnames)
    return list(zip(all_fnames, all_labels, all_preds))

This code loads each saved model, evaluates it on the selected 100-sample subset, and prints the true and predicted accent class (as one-based indices) for each file, marking whether each prediction is correct. After processing all samples, it calculates and prints the model’s overall accuracy on the subset.

In [14]:
#Hier nog kiezen voor model

for model_file in model_files:
    model_class = model_classes[model_file]
    model = model_class().to(device)
    model_path = os.path.join(saved_models_dir, model_file)
    model.load_state_dict(torch.load(model_path, map_location=device))
    model.eval()
    print(f"\nEvaluation on subset for model: {model_file}")
    results = evaluate_on_subset(model, subset_loader, device)
    correct = 0
    for fname, true_label, pred_label in results:
        is_correct = true_label == pred_label
        correct += is_correct
        print(f"File: {fname} | True Accent: {true_label + 1} | Predicted Accent: {pred_label + 1} | {'✔️' if is_correct else '❌'}")
    print(f"Accuracy on subset: {correct/len(results)*100:.2f}%")


Evaluation on subset for model: CNNBaseline_Dropout5_BatchNorm_not_augmented_latest_2d.pth
File: 2f_7399.wav | True Accent: 2 | Predicted Accent: 2 | ✔️
File: 1m_5041.wav | True Accent: 1 | Predicted Accent: 1 | ✔️
File: 1f_4107.wav | True Accent: 1 | Predicted Accent: 1 | ✔️
File: 3m_3181.wav | True Accent: 3 | Predicted Accent: 3 | ✔️
File: 1m_8027.wav | True Accent: 1 | Predicted Accent: 2 | ❌
File: 3f_4283.wav | True Accent: 3 | Predicted Accent: 3 | ✔️
File: 2m_2504.wav | True Accent: 2 | Predicted Accent: 2 | ✔️
File: 3m_6518.wav | True Accent: 3 | Predicted Accent: 3 | ✔️
File: 4m_2067.wav | True Accent: 4 | Predicted Accent: 4 | ✔️
File: 1m_8195.wav | True Accent: 1 | Predicted Accent: 1 | ✔️
File: 5f_2432.wav | True Accent: 5 | Predicted Accent: 5 | ✔️
File: 3m_8721.wav | True Accent: 3 | Predicted Accent: 3 | ✔️
File: 1f_6268.wav | True Accent: 1 | Predicted Accent: 1 | ✔️
File: 4m_7425.wav | True Accent: 4 | Predicted Accent: 4 | ✔️
File: 1m_5430.wav | True Accent: 1 | Pred

## Data augmentation

This code defines a dataset class, AccentSpectrogramDatasetAug, which extends the original accent spectrogram dataset to include random audio data augmentation methods, such as adding noise, time shifting, and adjusting volume, during training. Each time a sample is loaded, one or more augmentations may be applied to the waveform before converting it to a spectrogram

In [2]:
import torch
import torchaudio
import os

class AccentSpectrogramDatasetAug(AccentSpectrogramDataset):
    def __init__(self, *args, noise_level=0.005, **kwargs):
        super().__init__(*args, **kwargs)
        self.noise_level = noise_level

    def add_noise(self, waveform, noise_level=None):
        if noise_level is None:
            noise_level = self.noise_level
        noise = torch.randn_like(waveform) * noise_level
        return waveform + noise

    def time_shift(self, waveform, shift_max=0.2):
        shift = int(waveform.size(1) * shift_max * (2 * torch.rand(1) - 1))
        return torch.roll(waveform, shifts=shift, dims=1)

    def random_volume(self, waveform, min_gain=0.8, max_gain=1.2):
        gain = torch.empty(1).uniform_(min_gain, max_gain)
        return waveform * gain

    def augment(self, waveform, sr):
        if torch.rand(1).item() < 0.5:
            waveform = self.add_noise(waveform)
        if torch.rand(1).item() < 0.5:
            waveform = self.time_shift(waveform)
        if torch.rand(1).item() < 0.5:
            waveform = self.random_volume(waveform)
        return waveform

    def __getitem__(self, idx):
        path = self.file_paths[idx]
        waveform, sr = torchaudio.load(path)
        if sr != self.target_sr:
            waveform = T.Resample(sr, self.target_sr)(waveform)
        # Apply random augmentations
        waveform = self.augment(waveform, sr)
        spec = self._transform(waveform)
        if self.log_scale:
            spec = torch.log(spec + 1e-6)
        fname = os.path.basename(path)
        accent = int(fname[0]) - 1
        gender = fname[1]
        return spec, accent, gender

This code defines a pad_collate function for use with a PyTorch DataLoader, which takes a batch of spectrogram samples and ensures all spectrograms have the same width (target_width, default 208) by padding or cropping as needed. The function returns a stacked tensor of padded spectrograms, a tensor of accent labels, and a list of gender labels, making it possible to efficiently batch and process variable-length audio data.

In [3]:
import torch.nn.functional as F
def pad_collate(batch, target_width=208):
    specs, accents, genders = zip(*batch)
    padded_specs = []
    for s in specs:
        pad_amount = target_width - s.shape[-1]
        if pad_amount > 0:
            padded = torch.nn.functional.pad(s, (0, pad_amount))
        else:
            padded = s[..., :target_width]
        padded_specs.append(padded)
    return (
        torch.stack(padded_specs),
        torch.tensor(accents),
        list(genders)   # <--- returns a list of 'm'/'f'
    )

Inspecting the data

In [4]:
#dataset = AccentSpectrogramDataset("/Users/larsheijnen/DL/Train")
dataset = AccentSpectrogramDatasetAug("/Users/larsheijnen/DL/Train")
print(f"Total samples: {len(dataset)}")

# Look at shape of first spectrogram
x, y, z= dataset[6]
print(f"Spectrogram shape: {x.shape}")
print(f"Label: {y}")
print(f"Gender: {z}")

Total samples: 3166
Spectrogram shape: torch.Size([1, 201, 526])
Label: 1
Gender: m


### Data loader

This section sets up a PyTorch DataLoader to efficiently batch and shuffle data, using a small batch_size and a custom pad_collate function to pad or crop spectrograms so all inputs in a batch have the same shape. The resulting batches contain stacked spectrogram tensors, accent labels, and gender labels, ready for use in training or evaluation.

In [5]:
from torch.utils.data import DataLoader

# Use batch_size=4 for low RAM, pin_memory is False for macOS/MPS
dataloader = DataLoader(dataset, batch_size=4, shuffle=True, collate_fn=pad_collate, pin_memory=False)

# Try again
for batch in dataloader:
    spectrograms, accents, gender = batch
    print(f"Spectrograms: {spectrograms.shape}")  # (B, 1, F, T)
    print(f"Accents: {accents}")                  # (B,)
    print(f"Gender: {gender}")
    break

Spectrograms: torch.Size([4, 1, 201, 208])
Accents: tensor([3, 0, 3, 3])
Gender: ['m', 'f', 'f', 'm']


### Models

#### Overview

This code defines six variations of a simple Convolutional Neural Network (CNN) architecture in PyTorch, designed for multi-class classification (default: 5 classes, such as accent or gender prediction from spectrograms). Each model uses three convolutional layers, a global pooling layer to standardize feature map sizes, and a fully connected (linear) output layer. The variations explore the effects of batch normalization and dropout regularization on training stability and overfitting.

---

#### Model Structure

**Shared Core Design (all models):**

- **Convolutional Layers (`Conv2d`):**  
  Three stacked 2D convolutional layers with increasing channel depth (8 → 16 → 32), kernel size 3×3, and padding to keep feature map sizes consistent.

- **Activation (`ReLU`):**  
  Each convolution is followed by a ReLU activation for non-linearity.

- **Pooling (`AdaptiveAvgPool2d`):**  
  Shrinks feature maps to a fixed 16×16 size, allowing input spectrograms of varying dimensions.

- **Flattening:**  
  The pooled output is flattened to a 1D vector.

- **Fully Connected Layer (`Linear`):**  
  Maps extracted features to the desired number of classes.

---

#### Model Variants

1. **CNNBaseline**  
   The simplest model: just convolutions, activations, pooling, and a final linear layer.

2. **CNNBaseline_BatchNorm**  
   Adds batch normalization after each convolution (not after activation in the code), which stabilizes and speeds up training by normalizing layer inputs.

3. **CNNBaseline_Dropout3**  
   Adds a dropout layer (dropout probability 0.3) before the final linear layer to randomly zero some activations during training, helping prevent overfitting.

4. **CNNBaseline_Dropout5**  
   Same as above but with a higher dropout rate (0.5) for stronger regularization.

5. **CNNBaseline_Dropout3_BatchNorm**  
   Combines both batch normalization after each convolution and dropout (0.3) before the final layer.

6. **CNNBaseline_Dropout5_BatchNorm**  
   Combines batch normalization with a higher dropout rate (0.5).

---

#### Forward Pass Flow (for all models)

1. **Input:**  
   Receives a batch of spectrograms (e.g., `[B, 1, F, T]`).

2. **Conv → ReLU → [BatchNorm]:**  
   Processes input through three convolutional layers with ReLU activations; some models also normalize with batch normalization.

3. **Pooling:**  
   Reduces output to a fixed 16×16 feature map.

4. **Dropout (if included):**  
   Applies dropout regularization before classification.

5. **Flatten and Classify:**  
   Flattens the pooled feature map and passes it to the fully connected layer for prediction.

---

#### Why so many models?

Testing these variations helps determine which combination of normalization and regularization yields the best training stability and generalization for your particular data and task.


In [6]:
import torch.nn as nn
import torch.nn.functional as F

#Model 1 (baseline)
class CNNBaseline(nn.Module):
    def __init__(self, num_classes: int = 5):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 8, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
    
        self.pool = nn.AdaptiveAvgPool2d((16, 16))  
        self.fc = nn.Linear(32 * 16 * 16, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)
    
#Model 2 (baseline + batch normalization)
class CNNBaseline_BatchNorm(nn.Module):
    def __init__(self, num_classes: int = 5):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 8, kernel_size=3, padding=1)
        self.bn1   = nn.BatchNorm2d(8)
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, padding=1)
        self.bn2   = nn.BatchNorm2d(16)
        self.conv3 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.bn3   = nn.BatchNorm2d(32)
        
        self.pool = nn.AdaptiveAvgPool2d((16, 16))  
        self.fc = nn.Linear(32 * 16 * 16, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)
    
#Model 3 (baseline + dropout 0.3)
class CNNBaseline_Dropout3(nn.Module):
    def __init__(self, num_classes: int = 5, dropout_p: float = 0.3):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 8, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        
        self.pool = nn.AdaptiveAvgPool2d((16, 16))  
        self.dropout = nn.Dropout(dropout_p)
        self.fc = nn.Linear(32 * 16 * 16, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)
    
#Model 4 (baseline + dropout 0.5)
class CNNBaseline_Dropout5(nn.Module):
    def __init__(self, num_classes: int = 5, dropout_p: float = 0.5):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 8, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        
        self.pool = nn.AdaptiveAvgPool2d((16, 16))  
        self.dropout = nn.Dropout(dropout_p) 
        self.fc = nn.Linear(32 * 16 * 16, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)

#Model 5 (baseline + bacth normalization + dropout 0.3)
class CNNBaseline_Dropout3_BatchNorm(nn.Module):
    def __init__(self, num_classes: int = 5, dropout_p: float = 0.3):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 8, kernel_size=3, padding=1)
        self.bn1   = nn.BatchNorm2d(8)
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, padding=1)
        self.bn2   = nn.BatchNorm2d(16)
        self.conv3 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.bn3   = nn.BatchNorm2d(32)

        self.pool = nn.AdaptiveAvgPool2d((16, 16))  
        self.dropout = nn.Dropout(dropout_p) 
        self.fc = nn.Linear(32 * 16 * 16, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)

#Model 6 (baseline + bacth normalization + dropout 0.5)
class CNNBaseline_Dropout5_BatchNorm(nn.Module):
    def __init__(self, num_classes: int = 5, dropout_p: float = 0.5):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 8, kernel_size=3, padding=1)
        self.bn1   = nn.BatchNorm2d(8)
        self.conv2 = nn.Conv2d(8, 16, kernel_size=3, padding=1)
        self.bn2   = nn.BatchNorm2d(16)
        self.conv3 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.bn3   = nn.BatchNorm2d(32)

        self.pool = nn.AdaptiveAvgPool2d((16, 16))  
        self.dropout = nn.Dropout(dropout_p) 
        self.fc = nn.Linear(32 * 16 * 16, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = F.relu(self.conv3(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)


### Models_dict

This code creates a dictionary, models_dict, that links string keys (like "Model1", "Model2", etc.) to each CNN model class defined earlier.

In [7]:
models_dict = {
    "Model1": CNNBaseline,
    "Model2": CNNBaseline_BatchNorm, 
    "Model3": CNNBaseline_Dropout3,
    "Model4": CNNBaseline_Dropout5,
    "Model5": CNNBaseline_Dropout3_BatchNorm,
    "Model6": CNNBaseline_Dropout5_BatchNorm,}

This script prepares a dataset of accent spectrograms, splits it into training and testing sets, and uses DataLoader for batching. It defines functions to evaluate model performance overall and by gender, trains each CNN model variant for several epochs, prints detailed metrics at each epoch, saves the trained models, and outputs both overall and gender-specific classification reports for each model.

### Training models on augmented data, using early stop.

In [8]:
import torch
import os
import torch.nn as nn
from torch.utils.data import DataLoader, random_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# --- Assuming the following are defined in previous cells ---
# - AccentSpectrogramDatasetAug class
# - pad_collate function
# - CNN model classes (CNNBaseline, CNNBaseline_BatchNorm, CNNBaseline_Dropout3, etc.)
# - models_dict (mapping model names to classes)
# - evaluate function
# - evaluate_by_gender function
# - classification_report_for_model function
# -------------------------------------------------------------

# 1. Prepare dataset & split (using Augmented Data)
dataset = AccentSpectrogramDatasetAug(
    '/Users/larsheijnen/DL/Train',  # Ensure this path is correct
    target_sr=16000,
    use_mel=True,
    n_fft=1024,
    hop_length=256,
    n_mels=64,
    log_scale=True
)

train_len = int(0.8 * len(dataset))
test_len  = len(dataset) - train_len
# Ensure consistent splitting for comparability if re-running
train_ds, test_ds = random_split(dataset, [train_len, test_len], generator=torch.Generator().manual_seed(42))

# DataLoaders
train_loader = DataLoader(train_ds, batch_size=4, shuffle=True, collate_fn=pad_collate, pin_memory=False)
test_loader  = DataLoader(test_ds,  batch_size=4, shuffle=False, collate_fn=pad_collate, pin_memory=False)

# Setup device and loss function
device    = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

# Early stopping parameters
patience = 20  # Number of epochs to wait for improvement before stopping
max_epochs = 150 # Maximum number of epochs to train for
min_improvement = 0.005  # Minimum improvement in test accuracy to be considered significant (0.5%)

# Directory for saving models from this experiment
save_dir_base = "/Users/larsheijnen/DL/saved_models/B/augmented_earlystop_all_models" # New subdir
os.makedirs(save_dir_base, exist_ok=True)

print(f"Using device: {device}")
print(f"Number of training samples: {len(train_ds)}")
print(f"Number of test samples: {len(test_ds)}")

# Loop through each model configuration defined in models_dict
for model_key_name, model_class in models_dict.items():
    current_model_name = model_class.__name__ # e.g., "CNNBaseline_Dropout3"
    
    print(f"\n\n=== Training {current_model_name} with Early Stopping (Augmented Data) ===")
    
    model = model_class().to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

    best_test_acc_for_current_model = 0.0
    patience_counter_for_current_model = 0
    
    # Paths for saving this specific model
    best_model_path = os.path.join(save_dir_base, f"{current_model_name}_best.pth")
    final_model_path = os.path.join(save_dir_base, f"{current_model_name}_final_training.pth")

    # train_acc_history = [] # Optional: for plotting learning curves later
    # test_acc_history = []  # Optional

    for epoch in range(max_epochs):
        model.train()
        running_loss = 0.0
        for specs, labels, _ in train_loader: # Gender is not used in the loss calculation
            specs, labels = specs.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(specs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        avg_epoch_loss = running_loss / len(train_loader)
        
        # Evaluate performance at the end of the epoch
        # We only need train_acc for printing here, full metrics for test
        train_acc, _, _, _ = evaluate(train_loader, model, device) 
        test_acc, _, _, test_f1 = evaluate(test_loader, model, device)
        
        # train_acc_history.append(train_acc)
        # test_acc_history.append(test_acc)
        
        print(
            f"Epoch {epoch+1:3d}/{max_epochs} | Model: {current_model_name} | "
            f"Loss: {avg_epoch_loss:.4f} | "
            f"Train Acc: {train_acc*100:5.2f}% | "
            f"Test Acc: {test_acc*100:5.2f}% | Test F1: {test_f1*100:5.2f}% | "
            f"Patience: {patience_counter_for_current_model}/{patience}"
        )
        
        # Check for improvement
        if test_acc > best_test_acc_for_current_model + min_improvement:
            best_test_acc_for_current_model = test_acc
            patience_counter_for_current_model = 0
            torch.save(model.state_dict(), best_model_path)
            print(f"    → New best test accuracy for {current_model_name}: {best_test_acc_for_current_model*100:.3f}% (saved to {best_model_path})")
        else:
            patience_counter_for_current_model += 1
            
        # Early stopping check
        if patience_counter_for_current_model >= patience:
            print(f"\nEarly stopping triggered for {current_model_name} after {epoch+1} epochs.")
            break
    
    # Save final model state (this is the state at the end of training, might not be the best)
    torch.save(model.state_dict(), final_model_path)
    print(f"Final training state for {current_model_name} saved to {final_model_path}")

    print(f"\nTraining completed for {current_model_name} after {epoch+1} epochs.")
    print(f"Best test accuracy achieved during training for {current_model_name}: {best_test_acc_for_current_model*100:.3f}%")

    # Load the best model for final evaluation (if it was saved)
    if os.path.exists(best_model_path):
        print(f"\nLoading best saved model for {current_model_name} from {best_model_path} for final evaluation...")
        model.load_state_dict(torch.load(best_model_path, map_location=device))
        eval_model_description = "best saved"
    else:
        print(f"\nNo best model was saved for {current_model_name} (best accuracy did not sufficiently improve). Using final model state for evaluation.")
        # model is already in its final state, or you could explicitly load final_model_path
        # model.load_state_dict(torch.load(final_model_path, map_location=device)) 
        eval_model_description = "final"

    print(f"\nFinal Classification Report for {current_model_name} (using {eval_model_description} model):")
    classification_report_for_model(model, test_loader, device)

    print(f"\nFinal Gender breakdown for {current_model_name} (using {eval_model_description} model):")
    gender_results = evaluate_by_gender(test_loader, model, device)
    for gender_key_code in gender_results: # gender_key_code will be 'm' or 'f'
        gender_label = "Male" if gender_key_code == "m" else "Female"
        metrics_dict = gender_results[gender_key_code]
        metrics_str = ", ".join([f"{k.capitalize()}: {v*100:.2f}%" if isinstance(v, float) else f"{k.capitalize()}: {v}" for k,v in metrics_dict.items()])
        print(f"  {gender_label}: {metrics_str}")
        
    # Final summary for the current model
    # Re-evaluate the loaded model (best or final) on the training set for accurate final train accuracy
    final_train_acc_loaded_model, _, _, _ = evaluate(train_loader, model, device)
    print(f"\n--- Summary for {current_model_name} ---")
    print(f"- Total epochs trained: {epoch+1}")
    print(f"- Best validation accuracy during training: {best_test_acc_for_current_model*100:.3f}%")
    print(f"- Training accuracy of loaded ({eval_model_description}) model: {final_train_acc_loaded_model*100:.2f}%")
    if os.path.exists(best_model_path):
        print(f"- Best model saved to: {best_model_path}")
    else:
        print(f"- Best model not saved (or final model is the best achieved). Final model at: {final_model_path}")
    print(f"---------------------------------------\n")

print("\n\nAll model configurations have been trained and evaluated.")


=== Training Augmented model: CNNBaseline ===
Epoch  1 | Train Loss: 973.139 | Train Acc: 44.98% | Train Prec: 45.75% | Train Recall: 42.51% | Train F1: 40.65% || Test Acc: 43.69% | Test Prec: 44.59% | Test Recall: 41.90% | Test F1: 39.46%
Epoch  2 | Train Loss: 766.049 | Train Acc: 58.25% | Train Prec: 61.33% | Train Recall: 54.62% | Train F1: 53.41% || Test Acc: 53.63% | Test Prec: 59.41% | Test Recall: 51.27% | Test F1: 48.80%
Epoch  3 | Train Loss: 607.585 | Train Acc: 68.52% | Train Prec: 68.51% | Train Recall: 66.62% | Train F1: 67.12% || Test Acc: 64.35% | Test Prec: 64.83% | Test Recall: 62.80% | Test F1: 62.92%
Epoch  4 | Train Loss: 480.321 | Train Acc: 70.26% | Train Prec: 72.15% | Train Recall: 67.04% | Train F1: 67.01% || Test Acc: 61.99% | Test Prec: 65.19% | Test Recall: 59.94% | Test F1: 58.63%
Epoch  5 | Train Loss: 404.411 | Train Acc: 80.69% | Train Prec: 81.09% | Train Recall: 80.22% | Train F1: 80.12% || Test Acc: 71.77% | Test Prec: 73.25% | Test Recall: 71.00% |

In [9]:
import torch
import os
from torch.utils.data import DataLoader, random_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Prepare dataset & split
dataset = AccentSpectrogramDatasetAug(
    '/Users/larsheijnen/DL/Train',
    target_sr=16000,
    use_mel=True,
    n_fft=1024,
    hop_length=256,
    n_mels=64,
    log_scale=True)

train_len = int(0.8 * len(dataset))
test_len  = len(dataset) - train_len
train_ds, test_ds = random_split(dataset, [train_len, test_len], generator=torch.Generator().manual_seed(42))

train_loader = DataLoader(
    train_ds, batch_size=4, shuffle=True, collate_fn=pad_collate)
test_loader = DataLoader(
    test_ds, batch_size=4, shuffle=False, collate_fn=pad_collate)

device    = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

# General (not by gender) evaluation helper
def evaluate(loader, model, device):
    model.eval()
    all_preds, all_labels = [], []
    with torch.no_grad():
        for specs, labels, _ in loader:
            specs, labels = specs.to(device), labels.to(device)
            outputs = model(specs)
            preds = outputs.argmax(dim=1)
            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(labels.cpu().tolist())
    acc    = accuracy_score(all_labels, all_preds)
    prec   = precision_score(all_labels, all_preds, average='macro', zero_division=0)
    recall = recall_score(all_labels, all_preds, average='macro')
    f1     = f1_score(all_labels, all_preds, average='macro')
    return acc, prec, recall, f1

# Gender-based evaluation helper
def evaluate_by_gender(loader, model, device):
    model.eval()
    all_preds, all_labels, all_genders = [], [], []
    with torch.no_grad():
        for specs, labels, genders in loader:
            specs, labels = specs.to(device), labels.to(device)
            outputs = model(specs)
            preds = outputs.argmax(dim=1)
            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(labels.cpu().tolist())
            all_genders.extend(genders)
    results = {}
    for gender in ['m', 'f']:
        idxs = [i for i, g in enumerate(all_genders) if g == gender]
        gender_preds = [all_preds[i] for i in idxs]
        gender_labels = [all_labels[i] for i in idxs]
        acc = accuracy_score(gender_labels, gender_preds)
        prec = precision_score(gender_labels, gender_preds, average='macro', zero_division=0)
        recall = recall_score(gender_labels, gender_preds, average='macro')
        f1 = f1_score(gender_labels, gender_preds, average='macro')
        results[gender] = {'accuracy': acc, 'precision': prec, 'recall': recall, 'f1': f1}
    return results

def classification_report_for_model(model, loader, device):
    model.eval()
    all_preds, all_labels = [], []
    with torch.no_grad():
        for specs, labels, _ in loader:
            specs, labels = specs.to(device), labels.to(device)
            outputs = model(specs)
            preds = outputs.argmax(dim=1)
            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(labels.cpu().tolist())
    print(classification_report(all_labels, all_preds, digits=3))

# Train only CNNBaseline_Dropout3 with early stopping
model = CNNBaseline_Dropout3().to(device)
print(f"\n=== Training CNNBaseline_Dropout3 with Early Stopping (Augmented Data) ===")
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

# Early stopping parameters optimized for audio classification
best_test_acc = 0.0
patience = 20  # Higher patience for audio data
patience_counter = 0
max_epochs = 150  # Allow longer training
min_improvement = 0.005  # Require meaningful improvement (0.5%)

# Track training history
train_acc_history = []
test_acc_history = []

for epoch in range(max_epochs):
    model.train()
    running_loss = 0.0
    for specs, labels, genders in train_loader:
        specs, labels = specs.to(device), labels.to(device)
        optimizer.zero_grad()
        loss = criterion(model(specs), labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    # Evaluate performance
    train_acc, train_prec, train_recall, train_f1 = evaluate(train_loader, model, device)
    test_acc, test_prec, test_recall, test_f1 = evaluate(test_loader, model, device)
    
    train_acc_history.append(train_acc)
    test_acc_history.append(test_acc)
    
    print(
        f"Epoch {epoch+1:3d} | "
        f"Loss: {running_loss:.3f} | "
        f"Train Acc: {train_acc*100:5.2f}% | "
        f"Test Acc: {test_acc*100:5.2f}% | "
        f"Test F1: {test_f1*100:5.2f}% | "
        f"Patience: {patience_counter}/{patience}"
    )
    
    # Check for improvement
    if test_acc > best_test_acc + min_improvement:
        best_test_acc = test_acc
        patience_counter = 0
        # Save best model
        os.makedirs("/Users/larsheijnen/DL/saved_models/B/augmented", exist_ok=True)
        torch.save(
            model.state_dict(),
            f"/Users/larsheijnen/DL/saved_models/B/augmented/CNNBaseline_Dropout3_augmented_best_final.pth"
        )
        print(f"    → New best test accuracy: {best_test_acc*100:.3f}% (saved)")
    else:
        patience_counter += 1
        
    # Early stopping check
    if patience_counter >= patience:
        print(f"\nEarly stopping triggered after {epoch+1} epochs")
        print(f"Best test accuracy: {best_test_acc*100:.3f}%")
        break

# Save final model state
torch.save(
    model.state_dict(),
    f"/Users/larsheijnen/DL/saved_models/B/augmented/CNNBaseline_Dropout3_augmented_final_training.pth"
)

print(f"\nTraining completed after {epoch+1} epochs")
print(f"Best test accuracy achieved: {best_test_acc*100:.3f}%")

# Load best model for final evaluation
model.load_state_dict(torch.load(f"/Users/larsheijnen/DL/saved_models/B/augmented/CNNBaseline_Dropout3_augmented_best_final.pth"))

print(f"\nFinal Classification Report for CNNBaseline_Dropout3:")
classification_report_for_model(model, test_loader, device)

print(f"\nFinal Gender breakdown for CNNBaseline_Dropout3:")
gender_results = evaluate_by_gender(test_loader, model, device)
for gender in gender_results:
    label = "Male" if gender == "m" else "Female"
    print(f"{label}: {gender_results[gender]}")

print(f"\nModel training summary:")
print(f"- Total epochs trained: {epoch+1}")
print(f"- Best validation accuracy: {best_test_acc*100:.3f}%")
print(f"- Final training accuracy: {train_acc*100:.2f}%")
print(f"- Model saved to: /Users/larsheijnen/DL/saved_models/B/augmented/CNNBaseline_Dropout3_augmented_best_final.pth")


=== Training CNNBaseline_Dropout3 with Early Stopping (Augmented Data) ===
Epoch   1 | Loss: 971.528 | Train Acc: 48.82% | Test Acc: 45.43% | Test F1: 43.09% | Patience: 0/20
    → New best test accuracy: 45.426% (saved)
Epoch   2 | Loss: 739.082 | Train Acc: 60.23% | Test Acc: 56.31% | Test F1: 51.79% | Patience: 0/20
    → New best test accuracy: 56.309% (saved)
Epoch   3 | Loss: 586.645 | Train Acc: 72.16% | Test Acc: 66.40% | Test F1: 64.85% | Patience: 0/20
    → New best test accuracy: 66.404% (saved)
Epoch   4 | Loss: 482.887 | Train Acc: 76.11% | Test Acc: 68.14% | Test F1: 65.62% | Patience: 0/20
    → New best test accuracy: 68.139% (saved)
Epoch   5 | Loss: 416.314 | Train Acc: 78.40% | Test Acc: 70.19% | Test F1: 68.56% | Patience: 0/20
    → New best test accuracy: 70.189% (saved)
Epoch   6 | Loss: 338.008 | Train Acc: 81.56% | Test Acc: 75.87% | Test F1: 75.82% | Patience: 0/20
    → New best test accuracy: 75.868% (saved)
Epoch   7 | Loss: 311.677 | Train Acc: 84.16% | 

## Augmented Data and Training with early stopping

In [10]:
import torch
import os
import torch.nn as nn
from torch.utils.data import DataLoader, random_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# --- Assuming the following are defined in previous cells ---
# - AccentSpectrogramDatasetAug class
# - pad_collate function
# - CNN model classes (CNNBaseline, CNNBaseline_BatchNorm, CNNBaseline_Dropout3, etc.)
# - models_dict (mapping model names to classes)
# - evaluate function
# - evaluate_by_gender function
# - classification_report_for_model function
# -------------------------------------------------------------

# 1. Prepare dataset & split (using Augmented Data)
dataset = AccentSpectrogramDatasetAug(
    '/Users/larsheijnen/DL/Train',  # Ensure this path is correct
    target_sr=16000,
    use_mel=True,
    n_fft=1024,
    hop_length=256,
    n_mels=64,
    log_scale=True
)

train_len = int(0.8 * len(dataset))
test_len  = len(dataset) - train_len
# Ensure consistent splitting for comparability if re-running
train_ds, test_ds = random_split(dataset, [train_len, test_len], generator=torch.Generator().manual_seed(42))

# DataLoaders
train_loader = DataLoader(train_ds, batch_size=4, shuffle=True, collate_fn=pad_collate, pin_memory=False)
test_loader  = DataLoader(test_ds,  batch_size=4, shuffle=False, collate_fn=pad_collate, pin_memory=False)

# Setup device and loss function
device    = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
criterion = nn.CrossEntropyLoss()

# Early stopping parameters
patience = 20  # Number of epochs to wait for improvement before stopping
max_epochs = 150 # Maximum number of epochs to train for
min_improvement = 0.005  # Minimum improvement in test accuracy to be considered significant (0.5%)

# Directory for saving models from this experiment
save_dir_base = "/Users/larsheijnen/DL/saved_models/B/augmented_earlystop_all_models" # New subdir
os.makedirs(save_dir_base, exist_ok=True)

print(f"Using device: {device}")
print(f"Number of training samples: {len(train_ds)}")
print(f"Number of test samples: {len(test_ds)}")

# Loop through each model configuration defined in models_dict
for model_key_name, model_class in models_dict.items():
    current_model_name = model_class.__name__ # e.g., "CNNBaseline_Dropout3"
    
    print(f"\n\n=== Training {current_model_name} with Early Stopping (Augmented Data) ===")
    
    model = model_class().to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

    best_test_acc_for_current_model = 0.0
    patience_counter_for_current_model = 0
    
    # Paths for saving this specific model
    best_model_path = os.path.join(save_dir_base, f"{current_model_name}_best.pth")
    final_model_path = os.path.join(save_dir_base, f"{current_model_name}_final_training.pth")

    # train_acc_history = [] # Optional: for plotting learning curves later
    # test_acc_history = []  # Optional

    for epoch in range(max_epochs):
        model.train()
        running_loss = 0.0
        for specs, labels, _ in train_loader: # Gender is not used in the loss calculation
            specs, labels = specs.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(specs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

        avg_epoch_loss = running_loss / len(train_loader)
        
        # Evaluate performance at the end of the epoch
        # We only need train_acc for printing here, full metrics for test
        train_acc, _, _, _ = evaluate(train_loader, model, device) 
        test_acc, _, _, test_f1 = evaluate(test_loader, model, device)
        
        # train_acc_history.append(train_acc)
        # test_acc_history.append(test_acc)
        
        print(
            f"Epoch {epoch+1:3d}/{max_epochs} | Model: {current_model_name} | "
            f"Loss: {avg_epoch_loss:.4f} | "
            f"Train Acc: {train_acc*100:5.2f}% | "
            f"Test Acc: {test_acc*100:5.2f}% | Test F1: {test_f1*100:5.2f}% | "
            f"Patience: {patience_counter_for_current_model}/{patience}"
        )
        
        # Check for improvement
        if test_acc > best_test_acc_for_current_model + min_improvement:
            best_test_acc_for_current_model = test_acc
            patience_counter_for_current_model = 0
            torch.save(model.state_dict(), best_model_path)
            print(f"    → New best test accuracy for {current_model_name}: {best_test_acc_for_current_model*100:.3f}% (saved to {best_model_path})")
        else:
            patience_counter_for_current_model += 1
            
        # Early stopping check
        if patience_counter_for_current_model >= patience:
            print(f"\nEarly stopping triggered for {current_model_name} after {epoch+1} epochs.")
            break
    
    # Save final model state (this is the state at the end of training, might not be the best)
    torch.save(model.state_dict(), final_model_path)
    print(f"Final training state for {current_model_name} saved to {final_model_path}")

    print(f"\nTraining completed for {current_model_name} after {epoch+1} epochs.")
    print(f"Best test accuracy achieved during training for {current_model_name}: {best_test_acc_for_current_model*100:.3f}%")

    # Load the best model for final evaluation (if it was saved)
    if os.path.exists(best_model_path):
        print(f"\nLoading best saved model for {current_model_name} from {best_model_path} for final evaluation...")
        model.load_state_dict(torch.load(best_model_path, map_location=device))
        eval_model_description = "best saved"
    else:
        print(f"\nNo best model was saved for {current_model_name} (best accuracy did not sufficiently improve). Using final model state for evaluation.")
        # model is already in its final state, or you could explicitly load final_model_path
        # model.load_state_dict(torch.load(final_model_path, map_location=device)) 
        eval_model_description = "final"

    print(f"\nFinal Classification Report for {current_model_name} (using {eval_model_description} model):")
    classification_report_for_model(model, test_loader, device)

    print(f"\nFinal Gender breakdown for {current_model_name} (using {eval_model_description} model):")
    gender_results = evaluate_by_gender(test_loader, model, device)
    for gender_key_code in gender_results: # gender_key_code will be 'm' or 'f'
        gender_label = "Male" if gender_key_code == "m" else "Female"
        metrics_dict = gender_results[gender_key_code]
        metrics_str = ", ".join([f"{k.capitalize()}: {v*100:.2f}%" if isinstance(v, float) else f"{k.capitalize()}: {v}" for k,v in metrics_dict.items()])
        print(f"  {gender_label}: {metrics_str}")
        
    # Final summary for the current model
    # Re-evaluate the loaded model (best or final) on the training set for accurate final train accuracy
    final_train_acc_loaded_model, _, _, _ = evaluate(train_loader, model, device)
    print(f"\n--- Summary for {current_model_name} ---")
    print(f"- Total epochs trained: {epoch+1}")
    print(f"- Best validation accuracy during training: {best_test_acc_for_current_model*100:.3f}%")
    print(f"- Training accuracy of loaded ({eval_model_description}) model: {final_train_acc_loaded_model*100:.2f}%")
    if os.path.exists(best_model_path):
        print(f"- Best model saved to: {best_model_path}")
    else:
        print(f"- Best model not saved (or final model is the best achieved). Final model at: {final_model_path}")
    print(f"---------------------------------------\n")

print("\n\nAll model configurations have been trained and evaluated.")

Using device: mps
Number of training samples: 2532
Number of test samples: 634


=== Training CNNBaseline with Early Stopping (Augmented Data) ===
Epoch   1/150 | Model: CNNBaseline | Loss: 1.5104 | Train Acc: 24.29% | Test Acc: 22.08% | Test F1: 15.32% | Patience: 0/20
    → New best test accuracy for CNNBaseline: 22.082% (saved to /Users/larsheijnen/DL/saved_models/B/augmented_earlystop_all_models/CNNBaseline_best.pth)
Epoch   2/150 | Model: CNNBaseline | Loss: 1.2488 | Train Acc: 52.65% | Test Acc: 46.37% | Test F1: 44.07% | Patience: 0/20
    → New best test accuracy for CNNBaseline: 46.372% (saved to /Users/larsheijnen/DL/saved_models/B/augmented_earlystop_all_models/CNNBaseline_best.pth)
Epoch   3/150 | Model: CNNBaseline | Loss: 1.0593 | Train Acc: 64.61% | Test Acc: 57.57% | Test F1: 56.05% | Patience: 0/20
    → New best test accuracy for CNNBaseline: 57.571% (saved to /Users/larsheijnen/DL/saved_models/B/augmented_earlystop_all_models/CNNBaseline_best.pth)
Epoch   4/150 | Mod

### FINAL Predicting acccent on Test data with the best augmented models
This code loads a separate test dataset of accent spectrograms from a specified folder using the same preprocessing settings as training, creating a test_loader to batch the data for evaluation without shuffling. This allows for consistent, reproducible testing of model performance on new, unseen data. This is on the best models.

In [8]:
testset_folder = "/Users/larsheijnen/DL/Test set"
test_dataset = AccentSpectrogramDataset(
    testset_folder,
    target_sr=16000,
    use_mel=True,
    n_fft=1024,
    hop_length=256,
    n_mels=64,
    log_scale=True
)
test_loader = DataLoader(test_dataset, batch_size=4, shuffle=False, collate_fn=pad_collate)

This code dynamically sets the directory containing saved model weights (.pth files), lists all saved model files in that directory, and creates a mapping from each filename to its corresponding model class by checking the filename prefix.

In [9]:
import os
import torch

# Dynamically determine the saved models directory relative to this script or notebook
base_dir = os.path.dirname(os.path.abspath('assignment__b_latest-2.ipynb'))  # or __file__ if in .py
saved_models_dir = os.path.join(base_dir, "saved_models", "B", "augmented_earlystop_all_models")

# List all .pth files in the directory that contain "best"
model_files = [f for f in os.listdir(saved_models_dir) if f.endswith(".pth") and "best" in f]

# Map model file names to their classes (assumes naming convention: class name is prefix before first underscore or before '_latest')
model_classes = {}
for fname in model_files:
    if fname.startswith("CNNBaseline_Dropout3_BatchNorm"):
        model_classes[fname] = CNNBaseline_Dropout3_BatchNorm
    elif fname.startswith("CNNBaseline_Dropout5_BatchNorm"):
        model_classes[fname] = CNNBaseline_Dropout5_BatchNorm
    elif fname.startswith("CNNBaseline_Dropout3"):
        model_classes[fname] = CNNBaseline_Dropout3
    elif fname.startswith("CNNBaseline_Dropout5"):
        model_classes[fname] = CNNBaseline_Dropout5
    elif fname.startswith("CNNBaseline_BatchNorm"):
        model_classes[fname] = CNNBaseline_BatchNorm
    elif fname.startswith("CNNBaseline"):
        model_classes[fname] = CNNBaseline



This function predicts accent classes for each sample in the test set by passing batches through the model and collecting the predicted labels along with their corresponding filenames. It returns a list of tuples, where each tuple contains a filename and its predicted accent class.

In [10]:
def predict_accent_on_testset(model, test_loader, device):
    model.eval()
    all_preds = []
    all_fnames = []
    with torch.no_grad():
        for i, (specs, _, _) in enumerate(test_loader):  # gender is ignored
            specs = specs.to(device)
            outputs = model(specs)
            preds = outputs.argmax(dim=1).cpu().tolist()
            all_preds.extend(preds)
            # Get filenames for this batch
            batch_indices = range(i * test_loader.batch_size, i * test_loader.batch_size + len(preds))
            fnames = [os.path.basename(test_dataset.file_paths[idx]) for idx in batch_indices]
            all_fnames.extend(fnames)
    return list(zip(all_fnames, all_preds))

In [13]:
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

# Ensure the result directory exists
result_dir = os.path.join(base_dir, "result")
os.makedirs(result_dir, exist_ok=True)

for model_file, model_class in model_classes.items():
    model = model_class().to(device)
    model_path = os.path.join(saved_models_dir, model_file)
    model.load_state_dict(torch.load(model_path, map_location=device))
    model.eval()
    print(f"\nGenerating submission file for model: {model_file}")
    results = predict_accent_on_testset(model, test_loader, device)
    
    # Prepare submission lines
    submission_lines = ["Id,label"]
    for fname, pred in results:
        # Extract the numeric ID from the filename (e.g., 1023 from 1023.wav)
        file_id = os.path.splitext(fname)[0]
        submission_lines.append(f"{file_id},{pred+1}")  # Increase label by 1
    
    # Write to submission file in the result directory
    submission_filename = f"submission_{os.path.splitext(model_file)[0]}.csv"
    submission_path = os.path.join(result_dir, submission_filename)
    with open(submission_path, "w") as f:
        f.write("\n".join(submission_lines))
    print(f"Submission file saved as: {submission_path}")


Generating submission file for model: CNNBaseline_Dropout5_best.pth
Submission file saved as: /Users/larsheijnen/DL/result/submission_CNNBaseline_Dropout5_best.csv

Generating submission file for model: CNNBaseline_Dropout3_BatchNorm_best.pth
Submission file saved as: /Users/larsheijnen/DL/result/submission_CNNBaseline_Dropout3_BatchNorm_best.csv

Generating submission file for model: CNNBaseline_Dropout5_BatchNorm_best.pth
Submission file saved as: /Users/larsheijnen/DL/result/submission_CNNBaseline_Dropout5_BatchNorm_best.csv

Generating submission file for model: CNNBaseline_best.pth
Submission file saved as: /Users/larsheijnen/DL/result/submission_CNNBaseline_best.csv

Generating submission file for model: CNNBaseline_Dropout3_best.pth
Submission file saved as: /Users/larsheijnen/DL/result/submission_CNNBaseline_Dropout3_best.csv

Generating submission file for model: CNNBaseline_BatchNorm_best.pth
Submission file saved as: /Users/larsheijnen/DL/result/submission_CNNBaseline_Batch