# Lab 2 - X-vector systems
In this second lab, you will code, train, and optimize a x-vector system. At the end of this lab, you should be able to train your system on a given dataset, and have python files to re-use in the next labs of the semester, while using very simple metrics for the evaluation.

*This lab is due for Sunday 15th, 11:59PM.*

*If you have any question, please ask them during the lab session, email all TAs and instructors, or come during any of the 4 office hours.*

You will need the ```dataset.py``` file from Lab 1 and the dataset used in Lab 1

### Question 1 (10pts): Dataloaders
Using and modifying the previous lab ```dataset.py``` file, load the train, val and test loaders. 

You will need a batch_size of 32 for the train and val, batch_size of 1 for the test.

we want a 95/05 split between the train_val and the test speakers, and a 90/10 split between the train and the val utterances.

You can optionnaly use a sampler to balance for the gender, or shuffle the train instead.

You will need a collator that produces filterbanks, with 24 dimensions, a frame-length of 25ms.

Add a print line that show how many different speakers are in the train+val set, how many are in the train, the val and the test set, and in total.

In [1]:
from dataset import load_all_data
test_loader, val_loader, train_loader = load_all_data(
                                                metadata_file='VoxCeleb2_AE/metadata_dev.csv', 
                                                data_directory='VoxCeleb2_AE/dev', 
                                                batch_size=32,
                                                train_val_prop=0.9,
                                                train_test_prop=0.95,
                                                )

Speakers | train+val: 1039 | train: 1039 | val: 965 | test: 55 | total: 1094


### Question 2: The Classic X-vector Network

In this question, you will implement the architecture of a **classic X-vector system** following the structure described in: [X-Vectors: Robust DNN Embeddings for Speaker Recognition (Povey et al., ICASSP 2018)](https://www.danielpovey.com/files/2018_icassp_xvectors.pdf)

Your implementation should follow the same logical organization as in the paper, consisting of:

- **Frame-level TDNN (Time-Delay Neural Network) layers**  
- **Statistics pooling** (mean + standard deviation across time)  
- **Segment-level fully-connected layers**  
- **Embedding and classification heads**

#### Question 2.A (25 pts) â€” The TDNN Layer

You will first implement a `TDNNLayer` module.

- Input: a tensor of shape **[Batch, Time, Feature_dim]**  
- Context: a set of time offsets (e.g. `{t-2, t-1, t, t+1, t+2}`)  
- Output dimension: number of output units (e.g. 512)

For each frame at time `t`, your layer should:
1. Gather all context frames defined by the context offsets.  
2. Concatenate them into a single feature vector.  
3. Pass this vector through a `Linear` layer (and optionally a non-linearity).

Because frames at the beginning and end lack full context, the output will have fewer time steps.  
For a context `{t-2,â€¦,t+2}`, the resulting output shape should be **[Batch, Time âˆ’ 4, Output_dim]**.

Your `TDNNLayer` class should implement:
- `__init__(self, params)` â€” to define the context and linear layer  
- `forward(self, x)` â€” to perform frame splicing and linear transformation

Once implemented, verify that your `TDNNLayer` works correctly by:
- Passing a sample from your **val loader** through the layer  
- Checking that the output shape matches the expected dimensions

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class TDNNLayer(nn.Module):
    """Time Delay Neural Network layer."""
    def __init__(self, input_dim, input_context, output_dim):
        super().__init__()
        assert isinstance(input_context, (list, tuple)) and len(input_context) > 0

        self.context = list(input_context)
        self.min_off = min(self.context)
        self.max_off = max(self.context)

        self.linear = nn.Linear(input_dim * len(self.context), output_dim)
        self.act = nn.ReLU()
        


    def forward(self, x):
        if x.dim() != 3:
            raise ValueError(f"Expected [B,T,F], got {x.shape}")

        B, T, F = x.shape

        # valid center times so all context frames exist
        start_t = -self.min_off
        end_t = T - self.max_off   # exclusive
        if end_t <= start_t:
            raise ValueError(f"Sequence too short: T={T} for context [{self.min_off},{self.max_off}]")

        T_out = end_t - start_t

        # splice: collect frames at offsets, then concat on feature dim
        splices = []
        for off in self.context:
            splices.append(x[:, start_t + off : start_t + off + T_out, :])  # [B, T_out, F]

        x_spliced = torch.cat(splices, dim=2)  # [B, T_out, F*len(context)]
        y = self.linear(x_spliced)             # [B, T_out, output_dim]
        return self.act(y)


In [3]:
# Testing
# from models import TDNNLayer

model = TDNNLayer(24, [-2, -1, 0, 1, 2], 512)
for data_point in val_loader:
    melspec, spk_id, age, gender = data_point
    print(f"input shape: {melspec.shape}")
    output = model(melspec)
    print(f"output shape: {output.shape}")
    break

input shape: torch.Size([32, 498, 24])
output shape: torch.Size([32, 494, 512])


#### Question 2.B (25 pts) â€” Full X-vector Architecture

Next, you will build the complete **X-vector network** by combining multiple TDNN layers, a pooling layer, and fully-connected layers.

#### Expected architecture
Follow the structure described in the X-vector paper:

1. Frame-level TDNN stack
2. Statistics pooling  
   - Compute the **mean** and **standard deviation** of the frame-level outputs across the time dimension.  
   - Concatenate them â†’ shape becomes `[Batch, 3000]` for a 1500-dim input.
3. Segment-level (utterance-level) layers
   - Two fully-connected layers (e.g. 512 â†’ 512), followed by ReLU activations.  
   - The output of the first segment-level layer (before activation) is the **embedding (x-vector)**.
4. Classification head
   - A final linear layer mapping the 512-dim embedding to the number of training speakers.  
   - Used only during training (softmax + cross-entropy loss).

Define a class `XVector(nn.Module)` that:
- Inherits from `torch.nn.Module`  
- Combines all components described above  
- Returns: **Logits** for training

Use a batch from your ```val loader``` to verify that:
- The model runs without errors  
- Output dimensions are as expected: `[Batch, Num_speakers]`  

At the end of the lab, you are expected to save your networks in a ```models.py``` file, so they can be imported in future labs.

In [4]:
class XVector(nn.Module):
    def __init__(self, size=1, depth=1, num_speakers=1000, embedding_dim=512, input_dim=24):
        super().__init__()
        input_dim = 24  


        self.tdnn1 = TDNNLayer(input_dim, [-2, -1, 0, 1, 2], 512)
        self.tdnn2 = TDNNLayer(512, [-2, 0, 2], 512)
        self.tdnn3 = TDNNLayer(512, [-3, 0, 3], 512)
        self.tdnn4 = TDNNLayer(512, [0], 512)
        self.tdnn5 = TDNNLayer(512, [0], 1500)

        self.embedding_dim = embedding_dim

        self.seg_fc1 = nn.Linear(3000, embedding_dim)
        self.seg_fc2 = nn.Linear(embedding_dim, embedding_dim)
        self.relu = nn.ReLU()

        self.classifier = nn.Linear(embedding_dim, num_speakers)
        
    def forward(self, x):
        # Frame-level
        h = self.tdnn1(x)
        h = self.tdnn2(h)
        h = self.tdnn3(h)
        h = self.tdnn4(h)
        h = self.tdnn5(h)     # [B, T', 1500]

        mean = h.mean(dim=1)
        std = h.std(dim=1, unbiased=False)
        stats = torch.cat([mean, std], dim=1)  # [B, 3000]

        emb = self.seg_fc1(stats)
        h2 = self.relu(emb)
        h2 = self.relu(self.seg_fc2(h2))

        logits = self.classifier(h2)

        return logits

In [5]:
# Testing
# from models import XVector

model = XVector(input_dim=24, num_speakers=1000, embedding_dim=512)
for data_point in val_loader:
    melspec, spk_id, age, gender = data_point
    print(f"input shape: {melspec.shape}")
    output = model(melspec)
    print(f"output shape: {output.shape}")
    break

input shape: torch.Size([32, 498, 24])
output shape: torch.Size([32, 1000])


#### Question 2.C (10 pts) â€” Embedding layer

We now have a classical X-Vector system, that could be used to perform speaker identification on seen speakers. 
However, we can't perform speaker verification, on unseen speakers, as we would need to get the embeddings.
Extend your `XVector` network from Question 2.B to optionally return the **embedding vector** used for speaker verification:

- Modify the `forward()` function to include an optional argument `return_embedding` (default =`False`).
- When `return_embedding=True`, the method should return the **512-dimensional embedding** produced by the first segment-level layer **before the classification head**.
- When `return_embedding=False`, it should behave as before, returning the classification logits.

To verify everything works, use a batch from your ```val loader``` with `return_embedding=True` and `return_embedding=False` to verify that:
- `model(x, return_embedding=False)` returns logits of shape `[Batch_size, Num_speakers]`
- `model(x, return_embedding=True)` returns embeddings of shape `[Batch_size, 512]`

In [6]:
class XVector(nn.Module):
    def __init__(self, size=1, depth=1, num_speakers=1000, embedding_dim=512, input_dim=24):
        super().__init__()

        self.tdnn1 = TDNNLayer(input_dim, [-2, -1, 0, 1, 2], 512)
        self.tdnn2 = TDNNLayer(512, [-2, 0, 2], 512)
        self.tdnn3 = TDNNLayer(512, [-3, 0, 3], 512)
        self.tdnn4 = TDNNLayer(512, [0], 512)
        self.tdnn5 = TDNNLayer(512, [0], 1500)

        self.seg_fc1 = nn.Linear(3000, embedding_dim)  # embedding layer
        self.seg_fc2 = nn.Linear(embedding_dim, embedding_dim)
        self.relu = nn.ReLU()

        self.classifier = nn.Linear(embedding_dim, num_speakers)


    def forward(self, x, return_embedding=False):
        h = self.tdnn1(x)
        h = self.tdnn2(h)
        h = self.tdnn3(h)
        h = self.tdnn4(h)
        h = self.tdnn5(h)   # [B, T', 1500]

        mean = h.mean(dim=1)
        std = h.std(dim=1, unbiased=False)
        stats = torch.cat([mean, std], dim=1)   # [B, 3000]

        embedding = self.seg_fc1(stats)   

        if return_embedding:
            return embedding   # [B, 512]

        h2 = self.relu(embedding)
        h2 = self.relu(self.seg_fc2(h2))

        logits = self.classifier(h2)

        return logits

In [7]:
# Testing
# from models import XVector

model = XVector(input_dim=24, num_speakers=1000, embedding_dim=512)
for data_point in val_loader:
    melspec, spk_id, age, gender = data_point
    print(f"input shape: {melspec.shape}")
    logits = model(melspec, return_embedding=False)
    print(f"logits shape: {logits.shape}")
    embedding = model(melspec, return_embedding=True)
    print(f"embedding shape: {embedding.shape}")
    break

input shape: torch.Size([32, 498, 24])
logits shape: torch.Size([32, 1000])
embedding shape: torch.Size([32, 512])


### Question 3 (15 pts): Training and Validation

In this question, you will train the X-vector network that you implemented in Question 2 on the training set. You will also evaluate its performance on the validation set using **classification accuracy** as the validation metric.

1. **Training setup**
   - Use your `train_loader` to iterate over the training data.
   - Use **cross-entropy loss** (`nn.CrossEntropyLoss`) as the objective function.
   - Use **Adam** an optimizer.
   - Train the model for 10 epochs.

2. **Validation**
   - After each epoch, evaluate the model on the `val_loader`.
   - Compute the **validation accuracy**
   - Keep track of the **training loss**, **validation loss** and **validation accuracy** for each epoch.

3. **Reporting**
   - Print the training loss and validation accuracy across epochs.
   - Identify the epoch that gives the **best validation accuracy**.

After training:
- Print the best validation accuracy achieved.


_You can re-use your code from the last lab._

_Also, if you want to make the training faster, you can reduce the number of speakers in you dataloaders (maybe 100?) and the internal dimensions of the network (512 -> 64)_

In [8]:
import torch.optim as optim
import torch.nn as nn
import torch
from tqdm import tqdm 

from models import XVector
from dataset import load_all_data

number_train_speakers = 100
#Dataset loading
test_loader, val_loader, train_loader = load_all_data(
                                                metadata_file='VoxCeleb2_AE/metadata_dev.csv', 
                                                data_directory='VoxCeleb2_AE/dev', 
                                                batch_size=128,
                                                train_val_prop=0.9,
                                                train_test_prop=0.95,
                                                speaker_subset=[number_train_speakers,20]
                                                )
#New Model
model = XVector(input_dim=24, num_speakers=number_train_speakers, embedding_dim=64, internal_dim=64)

# Loss
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=5e-3, weight_decay=1e-5)

# Tracking
train_losses = []
val_losses = []
val_accuracies = []

best_val_acc = -1.0
best_epoch = -1



Speakers | train+val: 100 | train: 100 | val: 94 | test: 20 | total: 1094


In [None]:
device='mps' # I'm using apple silicon, change for 'cuda' if you have a nvidia gpu / using colab, use 'cpu' if you are patient
num_epochs=10
model.to(device)
for epoch in range(num_epochs):
    # --------------------
    # ðŸ”¹ TRAINING PHASE
    # --------------------
    model.train()
    train_loss, train_correct, train_total = 0.0, 0, 0

    for melspec, spk_id, age, gender in train_loader:
        melspec = melspec.to(device)
        spk_id = spk_id.to(device)

        optimizer.zero_grad()
        logits = model(melspec)
        loss = criterion(logits, spk_id)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        preds = torch.argmax(logits, dim=1)
        train_correct += (preds == spk_id).sum().item()
        train_total += spk_id.size(0)

    train_loss /= len(train_loader)
    train_acc = 100 * train_correct / train_total

        

    # --------------------
    # ðŸ”¹ VALIDATION PHASE
    # --------------------
    model.eval()
    val_loss, val_correct, val_total = 0.0, 0, 0
   
    with torch.no_grad():
        for melspec, spk_id, age, gender in tqdm(val_loader, desc=f"Val {epoch+1}/{num_epochs}", leave=False):
            melspec = melspec.to(device)
            spk_id = spk_id.to(device)

            logits = model(melspec)
            loss = criterion(logits, spk_id)

            val_loss += loss.item()
            preds = torch.argmax(logits, dim=1)
            val_correct += (preds == spk_id).sum().item()
            val_total += spk_id.size(0)

    val_loss = val_loss / len(val_loader)
    val_acc = 100.0 * val_correct / val_total
        
    # --------------------
    # ðŸ”¹ LOGGING
    # --------------------
    train_losses.append(train_loss)
    val_losses.append(val_loss)
    val_accuracies.append(val_acc)

    if val_acc > best_val_acc:
        best_val_acc = val_acc
        best_epoch = epoch + 1  
        
    print(f"Epoch [{epoch+1:02d}/{num_epochs}] "
            f"| Train Loss: {train_loss:.4f}, Acc: {train_acc:.2f}% "
            f"| Val Loss: {val_loss:.4f}, Acc: {val_acc:.2f}%")

# Reporting
print("\nTraining loss across epochs:")
for i, tl in enumerate(train_losses, start=1):
    print(f"  Epoch {i:02d}: {tl:.4f}")

print("\nValidation accuracy across epochs:")
for i, va in enumerate(val_accuracies, start=1):
    print(f"  Epoch {i:02d}: {va:.2f}%")

print(f"\nBest validation accuracy: {best_val_acc:.2f}% (Epoch {best_epoch})")


                                                                                

Epoch [01/10] | Train Loss: 6.2454, Acc: 4.95% | Val Loss: 5.2679, Acc: 0.31%


                                                                                

Epoch [02/10] | Train Loss: 4.0431, Acc: 8.22% | Val Loss: 5.5982, Acc: 0.31%


                                                                                

Epoch [03/10] | Train Loss: 3.7963, Acc: 11.35% | Val Loss: 6.7507, Acc: 0.62%


                                                                                

Epoch [04/10] | Train Loss: 3.5741, Acc: 14.10% | Val Loss: 7.8788, Acc: 4.06%


                                                                                

Epoch [05/10] | Train Loss: 3.4655, Acc: 16.00% | Val Loss: 7.3014, Acc: 3.12%


                                                                                

Epoch [06/10] | Train Loss: 3.2456, Acc: 19.12% | Val Loss: 10.2054, Acc: 4.69%


                                                                                

Epoch [07/10] | Train Loss: 3.0560, Acc: 22.21% | Val Loss: 10.4289, Acc: 3.75%


## What did it learned?
We saw that (hopefully) from the validation accuracy rising, your model learnt how to separate speakers.
However, we have no metric (yet) that could help us characterize the exact efficiency of this model on unseen speakers... 
For this, we are going to use a set of visualization techniques.

_Remark: You are not graded over the performances of your model, only the execution of the techniques. If it doesn't show what you epxected, don't re-train it 20 times._

### Question 4.A (10pts): TSNE for speaker separation
1. Extract all the speaker embeddings from the `test_loader`
2. Use `sklearn.manifold.TSNE` function  to learn a 2D TSNE of the speakers embeddings
3. Use `matplotlib.pyplot.scatter` function (or seaborn, if you like it pretty) to represent in a 2D figure the embedings from the test set with a different color per speaker.

What can you observe?
Are the embeddings separated?
The t-SNE plot shows that some speakers form clear clusters. That means the model learned to group recordings from the same speaker close together in the embedding space. For example, the clusters on the far right and top-left are quite compact, which suggests the model can clearly recognize those speakers.

However, in the middle of the plot, many points overlap and mix together. This means the model is not perfectly separating all speakers. Some voices are probably more similar to each other, or the model hasnâ€™t trained long enough to fully distinguish them.

In [None]:
# Get test embeddings

import numpy as np
import torch
from tqdm import tqdm

device='mps'
model = model.to(device)
model.eval()

all_embeddings = []
all_labels = []
all_genders = []

saved = {}

def hook_fn(module, inp, out):
    saved["emb"] = out  

handle = model.seg_fc2.register_forward_hook(hook_fn)

with torch.no_grad():
    for melspec, spk_id, age, gender in tqdm(test_loader, desc="Extract embeddings"):
        melspec = melspec.to(device)
        spk_id = spk_id.to(device)

        _ = model(melspec)          
        emb = saved["emb"]          
        emb = torch.relu(emb)       

        all_embeddings.append(emb.cpu().numpy())
        all_labels.append(spk_id.cpu().numpy())
        all_genders.append(gender.cpu().numpy() if torch.is_tensor(gender) else np.array(gender))

handle.remove()

all_embeddings = np.concatenate(all_embeddings, axis=0)  # [N, embedding_dim]
all_labels = np.concatenate(all_labels, axis=0)          # [N]
all_genders = np.concatenate(all_genders, axis=0) if len(all_genders) else None

print("Embeddings shape:", all_embeddings.shape)
print("Labels shape:", all_labels.shape)


In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

tsne = TSNE(n_components=2, perplexity=30, learning_rate="auto", init="pca", random_state=42)
embeddings_2d = tsne.fit_transform(all_embeddings)  

plt.figure(figsize=(10, 8))
scatter = plt.scatter(
    embeddings_2d[:, 0],
    embeddings_2d[:, 1],
    c=all_labels,      
    s=12,
    alpha=0.8
)
plt.title("t-SNE of X-Vector Speaker Embeddings (Test Set)")
plt.xlabel("t-SNE dim 1")
plt.ylabel("t-SNE dim 2")
plt.colorbar(scatter, label="Speaker ID")
plt.show()

print("t-SNE output shape:", embeddings_2d.shape)

### Question 4.B (5pts): Demographics details
Modify your scatterplot to add a different marker per gender, do you see a natural separation of the data?

In [None]:
plt.figure(figsize=(10, 8))
for gender in np.unique(all_genders):
    idx = all_genders == gender
    
    marker_style = "o" if gender == 0 else "x" 
    
    scatter = plt.scatter(
        embeddings_2d[idx, 0],
        embeddings_2d[idx, 1],
        c=all_labels[idx],      
        cmap="viridis",
        marker=marker_style,
        s=20,
        alpha=0.8,
        label=f"Gender {gender}"
    )

plt.title("t-SNE of X-Vector Embeddings (Color=Speaker, Marker=Gender)")
plt.xlabel("t-SNE dim 1")
plt.ylabel("t-SNE dim 2")
plt.legend()
plt.colorbar(scatter, label="Speaker ID")
plt.show()


## Helper function: Saving and Loading your model
If you want to save/load a version of your model, in case you need to restart the notebook, you can use the following commands:

In [None]:
# Saving
import os
os.makedirs('models', exist_ok=True)
model_save_path = "models/xvector_epoch8.pt"
torch.save(model.state_dict(), model_save_path)

In [None]:
# Loading
import os
from models import XVector
model_save_path = "models/xvector_epoch8.pt"
# model = XVector(input_dim=24, num_speakers=100, embedding_dim=64, )
model.load_state_dict(torch.load(model_save_path))