# Lab 5: Text-to-Speech (TTS)

# To Begin

This is a practical part of your ASR-TTS course. In total you will have 5 labs. Three of which will be focused on Automatic Speech Recognition and two on Text-to-Speech models. Each lab will last two hours and consist of two parts:
* Reading Part
* Coding Part 

In each part you might find question or tasks/activities to complete. 

LAB 5/5




# What will you learn in LAB 5?
* Review and reinforce the core building blocks of ASR and TTS systems
* Understand how speaker identity is encoded in speech signals
* Distinguish between what is said (linguistic content) and who says it (speaker identity)
* Understand why speaker identity can be considered sensitive personal data
* Learn the motivation behind speaker anonymization and voice privacy protection
* Analyze how modern speech systems can unintentionally leak speaker identity


# Revision

### <span style="color:red"> **Questions: ASR Revision**</span>


1. Give two advantages and two limitations and explain the difference in the architecture of:
* traditional modular ASR systems
* end-to-end ASR systems

In [None]:
# Answer

2. Explain why feature extraction is needed in ASR and outline the main steps involved, from waveform to feature vectors.

In [None]:
# Answer

3. Explain how speech conversion works.
Help: https://arxiv.org/abs/2008.03648


In [None]:
# Answer

### <span style="color:red"> **Questions: TTS Revision**</span>


1. Explain the architecture of a TTS model and what each part is responsible for

In [None]:
# Answer

2. What is a triphone. Name one advantage and one disadvantage of using it.

https://www.researchgate.net/publication/4194601_Biphone-rich_versus_triphone-rich_a_comparison_of_speech_corpora_in_automatic_speech_recognition

3. How can we evaluate a TTS model?


In [None]:
# Your Answer

# Speaker Identification

Speaker identification is the task of recognizing who is speaking from a voice recording. While ASR focuses on understanding what is being said and TTS focuses on producing speech, speaker identification focuses on the person behind the voice. This is important because modern speech systems can automatically learn and recognize speaker identity, even when this is not their main goal. When learning about ASR and TTS, it is therefore important to understand that speech technology does not only process words, but also personal information such as voice characteristics. This is closely related to speaker privacy and anonymization, which aim to protect speakers from being identified. The SpeechBrain toolkit includes models not only for ASR and TTS, but also for speaker identification and speaker embedding extraction, allowing us to study all these aspects within the same framework.

In this lab, we will use a pre-trained speaker embedding model from SpeechBrain to extract speaker representations from audio and identify who is speaking. 

### <span style="color:red"> **Questions: Speaker Identification**</span>




Think and explain what kind of systems could use speaker identification models

In [None]:
# Your Answer

Read the following research paper (pages 442-444) and explain what information about the person can be extracted from his voice and why is it dangerous.

https://encrypto.de/papers/NJTKHMAATMGPCESBRTB19.pdf

In [None]:
# Your Answer

In [5]:
import torch, torchaudio
print("torch:", torch.__version__)
print("torchaudio:", torchaudio.__version__)



torch: 2.4.1+cu121
torchaudio: 2.4.1+cu121


### We will need to install pyannote library 

In [None]:
!pip install "pyannote.audio==3.4.0" 

In [6]:
!pip install git+https://github.com/speechbrain/speechbrain.git@develop


In [None]:
import torch
import torchaudio
from speechbrain.inference.speaker import EncoderClassifier
from pyannote.audio import Pipeline
from scipy.spatial.distance import cdist
from huggingface_hub import hf_hub_download
import numpy as np
import soundfile as sf


### Audio generation

In order for us to have a conversation where different speakers can be identified, we will first join three separate audio recordings into a single audio file, adding a 1-second pause between each recording. This will simulate a simple conversation where different people speak one after another. We will then use a pre-trained speaker model to analyze the combined audio and automatically determine who is speaking and when. Through this process, you will see how speaker identity can be detected from speech.

In [None]:
# Input files of three different speakers from your labs dataset
wav_files = [

]

# Pause duration (seconds)
pause_duration = 1.0  # 1 second pause

# Read first file to get sample rate
audio_list = []
data, sr = sf.read(wav_files[0])
audio_list.append(data)

# Create silence
pause = np.zeros(int(pause_duration * sr))

# Process remaining files
for wav in wav_files[1:]:
    audio_list.append(pause)
    data, sr_i = sf.read(wav)
    assert sr_i == sr, "Sample rates must match"
    audio_list.append(data)

# Concatenate everything
final_audio = np.concatenate(audio_list)

# Save output
sf.write("combined.wav", final_audio, sr)

In [7]:
# NOT NECESSARY TO RUN IF YOU DO NOT HAVE A GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Loading the pre-trained model

In [None]:
# Load pre-trained model for speaker embedding extraction and move it to the device
# Note: You need to obtain an API key from Hugging Face to use this model.
classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", run_opts={"device": device})
classifier = classifier.to(device)

### Pipeline

For this part of the lab, you will need to register on Hugging Face, because some pre-trained speech models require an access token. This is mainly due to licensing, usage conditions, and privacy considerations, especially for models that process or reveal speaker identity. After creating a free Hugging Face account, you must accept the usage conditions of the speaker diarization model and generate a read-only access token, which allows your code to download and use the model. This token does not cost anything and is only used to verify that you have agreed to the model’s conditions.

Here is what you should do:

* Register and login: https://huggingface.co
* Accept the conditions of the following three:
* pyannote/speaker-diarization-3.1
* pyannote/segmentation-3.0
* pyannote/wespeaker-voxceleb-resnet34-LM
* Go to settings -> access tokens -> read token
* Mark all fields, create and copy/paste the token below

### <span style="color:red"> **Important security warning:**</span>
Never push access tokens to GitHub or any public repository.
Before submitting or sharing your lab work, you must delete your token from the notebook or script. Tokens should be treated like passwords. If you accidentally share a token, you should delete it immediately from your Hugging Face account and create a new one. In addition, do not upload you token into public AI chatbots.

In [None]:
# Pre-trained model for speaker diarization
# Note: The speaker diarization model also requires an API key from Hugging Face.
diarization = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1",
                                        use_auth_token="YOUR_HUGGING_FACE_API_KEY")

This speaker diarization system works by listening to the audio and automatically detecting when speech is present and when it changes from one speaker to another. For each short speech segment, the model extracts characteristics that are typical of a person’s voice. These characteristics are summarized into a compact numerical representation. Segments with similar voice characteristics are then grouped together, so that parts spoken by the same person are assigned the same speaker label. 

In [None]:
known_speaker_files = [
    ("path1", "Steve Jobs"),
    ("path2", "Elon Musk"),
    ("path3", "Nelson Mandela"),
]

known_speakers = []
known_speaker_ids = []

for wav_path, name in known_speaker_files:
    wav, sr = torchaudio.load(wav_path) 
    if wav.shape[0] > 1:
        wav = wav.mean(dim=0, keepdim=True)
    wav = wav.to(device)
    with torch.no_grad():
        emb = classifier.encode_batch(wav)  
    emb_vec = emb.squeeze().detach().cpu().numpy().reshape(-1)  
    known_speakers.append(emb_vec)
    known_speaker_ids.append(name)

known_matrix = np.vstack(known_speakers)  

print("Known matrix shape:", known_matrix.shape)

Known matrix shape: (3, 192)


In [56]:
audio_path = "/home/aine/Teaching/Lab_5/combined.wav"


In [57]:
segments = diarization(audio_path, min_speakers=1, max_speakers=3)


In [None]:
info = torchaudio.info(audio_path)
sr = info.sample_rate

threshold = 0.8  # tune

for segment, track, label in segments.itertracks(yield_label=True):
    start_time, end_time = segment.start, segment.end

    frame_offset = int(start_time * sr)
    num_frames = int((end_time - start_time) * sr)
    if num_frames <= 0:
        continue

    wav_seg, _ = torchaudio.load(audio_path, frame_offset=frame_offset, num_frames=num_frames)

    if wav_seg.shape[0] > 1:
        wav_seg = wav_seg.mean(dim=0, keepdim=True)

    wav_seg = wav_seg.to(device)

    with torch.no_grad():
        emb = classifier.encode_batch(wav_seg)

    emb_query = emb.squeeze().detach().cpu().numpy().reshape(1, -1)  
    # cosine distance to all known speakers
    distances = cdist(emb_query, known_matrix, metric="cosine").flatten()  
    best_i = int(np.argmin(distances))
    best_dist = float(distances[best_i])
    best_name = known_speaker_ids[best_i]

    if best_dist < threshold:
        print(f"{best_name} ({label}) : {start_time:.2f}s–{end_time:.2f}s  dist={best_dist:.3f}")
    else:
        print(f"Unknown ({label}) : {start_time:.2f}s–{end_time:.2f}s  best={best_name} dist={best_dist:.3f}")

### <span style="color:red"> **Question: exercise 1**</span>


Get into groups of three and record a conversation together plus one separate audio for each of your voices and run the speaker identification model. Discuss, how accurate is the model. Did it identify you correctly?

# Speaker Anonymization

### <span style="color:red"> **Question: Anonymization Techniques**</span>

Read the following research paper and describe two possible anonymization techniques:
https://inria.hal.science/hal-04667625/file/panariello_TASLP24.pdf


In [None]:
# Your Answer

# Wrap-up

The goal of this final lab session was to highlight that working with speech and voice technologies is not only a technical task, but also one that carries significant ethical and societal risks. Modern TTS and voice processing systems make it possible to manipulate speaker identity, generate realistic synthetic voices, and alter speech in ways that can be misused if not handled responsibly. By examining speaker identification and anonymization, this lab aimed to show that voice is a biometric signal and that its manipulation should not be underestimated. Understanding these risks is an important part of training for anyone who works with speech technologies.

