<a href="https://colab.research.google.com/github/KarlHajal/EE-554-TTS/blob/main/EE_554_XTTS_Synthesis_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EE-554 XTTS Synthesis Exercise

## Introduction
The goal of this exercise is to get familiar with modern text-to-speech (TTS) technology and explore the strengths and limitations of a popular open-source TTS model. We will use the Coqui.ai TTS library, which provides pre-trained neural models that are easy to download and use with just a few lines of code. In this exercise, we will focus on Coqui's XTTS model. This is a multi-lingual TTS model trained on 16 languages. It also supports zero-shot voice cloning, meaning that the model can copy a person’s voice after listening to a short sample, even if it has never seen that voice before during training.

To ensure the model inference is fast, **you can use a GPU runtime**. In the toolbar above, go to Runtime > Change runtime type, select T4 GPU, and Save.  

### Step 1: Install Requirements

In [None]:
!pip install coqui-tts
!pip install ipywebrtc
!sudo apt update && sudo apt install ffmpeg

from ipywebrtc import AudioRecorder, CameraStream
from IPython.display import Audio, display, Markdown
import ipywidgets as widgets
from google.colab import output
output.enable_custom_widget_manager()

import torch
from TTS.api import TTS

# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"

### Step 2: Download audio files

In [None]:
!git clone https://github.com/KarlHajal/EE-554-TTS/
!mv /content/EE-554-TTS/voice_examples /content/

### Step 3: Synthesis and Voice Cloning

Below, we will first test the XTTS model using a natural average voice in different languages. Next, we will try cloning different voices to see how well the model can imitate them. Finally, we will record our own voice samples and test how we can influence the model's outputs.

In [None]:
# Load the XTTS model
# When prompted, write 'y' in the text box and press enter to agree to the terms and conditions
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

#### **I - Multi-lingual Synthesis**
First, we will synthesize sentences in different languages using a US English female voice as the reference.

In [None]:
# Run XTTS
# XTTS is a multi-lingual and voice cloning text-to-speech model.
# The tts_to_file method below takes as inputs the text to transcribe, the language, and the reference voice clip.
tts.tts_to_file(text="I forgot my umbrella, and of course, it started raining.", speaker_wav="voice_examples/libri121.wav", language="en", file_path="english_output.wav")
display(Markdown("English Output:"))
display(Audio("/content/english_output.wav", autoplay=False))

tts.tts_to_file(text="J'ai oublié mon parapluie et bien sûr, il a commencé à pleuvoir.", language="fr", speaker_wav="voice_examples/libri121.wav", file_path="french_output.wav")
display(Markdown("French Output:"))
display(Audio("/content/french_output.wav", autoplay=False))

tts.tts_to_file(text="Ich habe meinen Regenschirm vergessen und natürlich hat es angefangen zu regnen.", language="de", speaker_wav="voice_examples/libri121.wav", file_path="german_output.wav")
display(Markdown("German Output:"))
display(Audio("/content/german_output.wav", autoplay=False))

tts.tts_to_file(text="Ho dimenticato l'ombrello e, naturalmente, ha cominciato a piovere.", language="it", speaker_wav="voice_examples/libri121.wav", file_path="italian_output.wav")
display(Markdown("Italian Output:"))
display(Audio("/content/italian_output.wav", autoplay=False))

tts.tts_to_file(text="Olvidé mi paraguas y, por supuesto, empezó a llover.", language="es", speaker_wav="voice_examples/libri121.wav", file_path="spanish_output.wav")
display(Markdown("Spanish Output:"))
display(Audio("/content/spanish_output.wav", autoplay=False))

#### **Analysis Questions**
1 - What did you think about the quality of the model's outputs?
- What aspects of the output were good?
- What issues or problems did you notice?

2 - How should we evaluate the model's performance?
- What important dimensions or qualities should we focus on?

3 - What methods can we use to measure these dimensions?
- What metrics or evaluation approaches can be used?

#### **II - Voice Cloning**
Next, we will clone a variety of interesting voices and observe the results. Before each synthesized output, you can also play the reference voice sample provided to the model for imitation.

In [None]:
tts.tts_to_file(text="I forgot my umbrella, and of course, it started raining.", speaker_wav="voice_examples/morgan.mp3", language="en", file_path="morgan_output.wav")
display(Markdown("###**Morgan:**"))
display(Markdown("Reference:"))
display(Audio("voice_examples/morgan.mp3", autoplay=False))
display(Markdown("Output:"))
display(Audio("/content/morgan_output.wav", autoplay=False))

tts.tts_to_file(text="I forgot my umbrella, and of course, it started raining.", speaker_wav="voice_examples/david.mp3", language="en", file_path="david_output.wav")
display(Markdown("###**David:**"))
display(Markdown("Reference:"))
display(Audio("voice_examples/david.mp3", autoplay=False))
display(Markdown("Output:"))
display(Audio("/content/david_output.wav", autoplay=False))

tts.tts_to_file(text="I forgot my umbrella, and of course, it started raining.", speaker_wav="voice_examples/thorsten_angry.wav", language="en", file_path="thorsten_angry_output.wav")
display(Markdown("###**Thorsten Angry:**"))
display(Markdown("Reference:"))
display(Audio("voice_examples/thorsten_angry.wav", autoplay=False))
display(Markdown("Output:"))
display(Audio("/content/thorsten_angry_output.wav", autoplay=False))

tts.tts_to_file(text="I forgot my umbrella, and of course, it started raining.", speaker_wav="voice_examples/thorsten_whisper.wav", language="en", file_path="thorsten_whisper_output.wav")
display(Markdown("###**Thorsten Whisper:**"))
display(Markdown("Reference:"))
display(Audio("voice_examples/thorsten_whisper.wav", autoplay=False))
display(Markdown("Output:"))
display(Audio("/content/thorsten_whisper_output.wav", autoplay=False))

tts.tts_to_file(text="I forgot my umbrella, and of course, it started raining.", speaker_wav="voice_examples/michael.mp3", language="en", file_path="michael_output.wav")
display(Markdown("###**Michael:**"))
display(Markdown("Reference:"))
display(Audio("voice_examples/michael.mp3", autoplay=False))
display(Markdown("Output:"))
display(Audio("/content/michael_output.wav", autoplay=False))

tts.tts_to_file(text="I forgot my umbrella, and of course, it started raining.", speaker_wav="voice_examples/arnold.mp3", language="en", file_path="arnold_output.wav")
display(Markdown("###**Arnold:**"))
display(Markdown("Reference:"))
display(Audio("voice_examples/arnold.mp3", autoplay=False))
display(Markdown("Output:"))
display(Audio("/content/arnold_output.wav", autoplay=False))

tts.tts_to_file(text="I forgot my umbrella, and of course, it started raining.", speaker_wav="voice_examples/sean.mp3", language="en", file_path="sean_output.wav")
display(Markdown("###**Sean:**"))
display(Markdown("Reference:"))
display(Audio("voice_examples/sean.mp3", autoplay=False))
display(Markdown("Output:"))
display(Audio("/content/sean_output.wav", autoplay=False))

tts.tts_to_file(text="I forgot my umbrella, and of course, it started raining.", speaker_wav="voice_examples/marvin.mp3", language="en", file_path="marvin_output.wav")
display(Markdown("###**Marvin:**"))
display(Markdown("Reference:"))
display(Audio("voice_examples/marvin.mp3", autoplay=False))
display(Markdown("Output:"))
display(Audio("/content/marvin_output.wav", autoplay=False))


#### **Analysis Questions**
1 - Did the quality of the outputs vary across different cases?

2 - What aspects of each voice did the model clone well?
- What details were missing or inaccurate?

3 - How should we evaluate the model’s ability to handle multiple speakers and voice cloning?
- What key features or dimensions should we focus on?

4 - What metrics and evaluation methods can help assess these features?


### Step 4: Record and Clone your own voice

In this step, you will record your voice using the tool below, and use the recorded sample as a reference for the TTS model to clone. In each case, try to record a minimum of 5 seconds.

1. **Quiet Environment**: Start by recording yourself in a quiet environment, reading any sentence of your choice using a neutral tone.

2. **Noisy Environment**: Record the same sentence again, this time in a noisy environment. (e.g. introduce background noise such as music or ambient sounds from your phone).

3. **Voice Modulation**: Record the sentence in a quiet environment again, but this time modulate your voice in various ways (e.g., change your pitch, speed, or tone) to observe how this affects the synthesized outputs. You can try to change your voice several times in the same recording to test what the model will pick up one.

Make sure to grant the browser access to your microphone when prompted. If the recording button doesn't work the first time you grant it permission, try to rerun the cell. After that you should be able to start recording by pressing the record button the first time (the dot will turn red), speaking into the mic, and then pressing the record button a second time to save the recording.

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
camera = CameraStream(constraints={'audio': True,'video':False})
recorder = AudioRecorder(stream=camera)
recorder

In [None]:
with open('my_recording.webm', 'wb') as f:
    f.write(recorder.audio.value)
!ffmpeg -i my_recording.webm -ac 1 -f wav my_recording.wav -y -hide_banner -loglevel panic
tts.tts_to_file(text="Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked.", speaker_wav="my_recording.wav", language="en", file_path="my_voice_output.wav")
display(Markdown("Output:"))
display(Audio("/content/my_voice_output.wav", autoplay=False))

#### **Analysis Questions:**
1 - Did the model clone your voice accurately?
- What features did it replicate well?
- What was missing or inaccurate?

2 - Did adding noise or music to the reference recording affect the output?

3 - How did changing or modulating your voice affect the results?

4 - What interesting observations did you make?

5 - Did you notice any strange outputs?
- Did the model produce unexpected or unrealistic results (hallucinations)?