# 🎙️ Lab 4: Building a Voice Assistant — From Speech to Intelligence and Back

---

## What You'll Build

In this lab you will construct a **complete voice assistant pipeline** — one that *listens* to a spoken question, *thinks* about it using a large language model, and *speaks* the answer back. By the end, you'll understand every link in this chain:

```
🎤 Your Voice  →  Speech-to-Text  →  LLM Reasoning  →  Text-to-Speech  →  🔊 Audio Response
```

The lab is split into two parts:

| Part | Focus | Key Idea |
|------|-------|----------|
| **Part 1** | Core pipeline (STT → LLM → TTS) | How do the three components connect? |
| **Part 2** | Neural TTS & voice cloning | What happens when we upgrade the "voice"? |


---
# Part 1: The Core Voice-Assistant Pipeline

In Part 1 we will build three independent components — **Speech-to-Text (STT)**, **LLM reasoning**, and **Text-to-Speech (TTS)** — test each one in isolation, then wire them together into a working assistant.

## 1.1 — Environment Setup

First, let's install the libraries we need. We'll use:

| Library | Role |
|---------|------|
| `librosa` / `soundfile` | Audio loading & saving |
| `gtts` | Google Text-to-Speech (simple, cloud-based) |
| `groq` | Fast API access to Whisper (STT) and Llama (LLM) |
| `python-dotenv` | Manage API keys cleanly |

In [1]:
# ── Install dependencies (run once) ──────────────────────────────────────────
%pip install librosa soundfile -q
%pip install IPython matplotlib numpy -q
%pip install gtts -q
%pip install groq -q
%pip install python-dotenv -q

print("✅ All dependencies installed.")

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m35.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
typer 0.24.1 requires click>=8.2.1, but you have click 8.1.8 which is incompatible.[0m[31m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.3/138.3 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h✅ All dependencies installed.


In [2]:
# ── Imports ──────────────────────────────────────────────────────────────────
import os, io, time, json
import numpy as np
from datetime import datetime
from IPython.display import Audio, display, HTML

import librosa
import soundfile as sf
from gtts import gTTS
from groq import Groq

print("✅ Imports successful.")

✅ Imports successful.


### 🔑 API Key Setup

We'll use the **Groq API** for both speech-to-text (Whisper) and the LLM (Llama 3.1). If you don't have a key yet, grab a free one at [console.groq.com](https://console.groq.com).

Run the cell below — it will prompt you to paste your key securely (the input is hidden).

In [3]:
import getpass

if "GROQ_API_KEY" not in os.environ:
    os.environ["GROQ_API_KEY"] = getpass.getpass("Enter your Groq API key: ")

print("✅ API key configured.")

Enter your Groq API key: ··········
✅ API key configured.


---
## 1.2 — Speech-to-Text (STT): Turning Sound into Words

The first stage of our pipeline converts an audio waveform into a text transcription. We'll use **Groq's hosted Whisper-large-v3** model, which is OpenAI's Whisper running on Groq's fast inference hardware.

**How Whisper works (in brief):**
1. The audio is converted into a log-mel spectrogram (a visual representation of frequencies over time).
2. A Transformer encoder reads the spectrogram.
3. A Transformer decoder generates the transcript token by token.

Let's wrap this in a small, reusable class.

In [4]:
class SpeechToTextEngine:
    """Thin wrapper around Groq's Whisper API."""

    def __init__(self):
        api_key = os.getenv("GROQ_API_KEY")
        self.client = Groq(api_key=api_key)
        print("🎤 SpeechToTextEngine ready (Whisper-large-v3 via Groq)")

    def transcribe(self, audio_path: str) -> str:
        """Transcribe an audio file and return the text."""
        with open(audio_path, "rb") as f:
            response = self.client.audio.transcriptions.create(
                file=f,
                model="whisper-large-v3",
                response_format="text",
            )
        return response.strip()

# Instantiate
stt_engine = SpeechToTextEngine()

🎤 SpeechToTextEngine ready (Whisper-large-v3 via Groq)


### Quick Test: Round-Trip Accuracy

To verify that our STT works, we'll **synthesise** a few sentences with gTTS (text → audio) and then **transcribe** them back (audio → text). A perfect system would return the original sentence.

In [5]:
test_phrases = [
    "Hello, how are you today?",
    "What is the weather like?",
    "Tell me a joke about artificial intelligence.",
    "What can you help me with?",
]

# Generate audio files from text
print("Creating test audio files with gTTS …\n")
test_audio_files = []
for i, phrase in enumerate(test_phrases, 1):
    filename = f"test_input_{i}.mp3"
    gTTS(text=phrase, lang="en", slow=False).save(filename)
    test_audio_files.append((filename, phrase))
    print(f"  📄 {filename}")

# Transcribe each file and compare
print("\n── STT Round-Trip Results ──────────────────────────────")
for audio_file, original in test_audio_files:
    transcribed = stt_engine.transcribe(audio_file)
    match = "✅" if transcribed.lower().strip("?.!") == original.lower().strip("?.!") else "⚠️"
    print(f"  {match}  Original : {original}")
    print(f"       Whisper  : {transcribed}\n")

print("🎤 STT component verified!")

Creating test audio files with gTTS …

  📄 test_input_1.mp3
  📄 test_input_2.mp3
  📄 test_input_3.mp3
  📄 test_input_4.mp3

── STT Round-Trip Results ──────────────────────────────
  ✅  Original : Hello, how are you today?
       Whisper  : Hello, how are you today?

  ✅  Original : What is the weather like?
       Whisper  : What is the weather like?

  ✅  Original : Tell me a joke about artificial intelligence.
       Whisper  : Tell me a joke about artificial intelligence.

  ✅  Original : What can you help me with?
       Whisper  : What can you help me with?

🎤 STT component verified!


---
## 1.3 — LLM Reasoning: The "Brain" of the Assistant

Now that we can convert speech to text, we need something to *think* about what the user said. We'll send the transcript to **Llama 3.1-8B** via Groq and get a natural-language response back.

A few things to notice in the code below:
- We pass a **system prompt** that sets the assistant's persona.
- `temperature=0.7` gives the model some creative freedom without being wild.
- `max_tokens=150` keeps responses concise (important when we later synthesise them to speech).

In [6]:
class LLMEngine:
    """LLM engine backed by Groq (Llama 3.1-8B)."""

    def __init__(self, model: str = "llama-3.1-8b-instant"):
        api_key = os.getenv("GROQ_API_KEY")
        self.client = Groq(api_key=api_key)
        self.model = model
        print(f"🧠 LLMEngine ready (model: {self.model})")

    def respond(self, user_input: str) -> str:
        """Generate a short, conversational response."""
        completion = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": "You are a helpful AI assistant. Provide clear, concise, and friendly responses."},
                {"role": "user", "content": user_input},
            ],
            temperature=0.7,
            max_tokens=150,
        )
        return completion.choices[0].message.content.strip()

# Instantiate
llm_engine = LLMEngine()

🧠 LLMEngine ready (model: llama-3.1-8b-instant)


In [7]:
# Quick sanity check — send a few prompts
sample_prompts = [
    "Hello, how are you?",
    "Tell me a joke.",
    "What can you help me with?",
]

print("── LLM Test Responses ─────────────────────────────────")
for prompt in sample_prompts:
    response = llm_engine.respond(prompt)
    print(f"  👤 {prompt}")
    print(f"  🤖 {response}\n")

print("🧠 LLM component verified!")

── LLM Test Responses ─────────────────────────────────
  👤 Hello, how are you?
  🤖 Hello. I'm doing well, thanks for asking. I'm a large language model, so I don't have feelings, but I'm here and ready to help you with any questions or tasks you may have. How about you? How's your day going?

  👤 Tell me a joke.
  🤖 Here's one:

What do you call a fake noodle?

An impasta.

  👤 What can you help me with?
  🤖 I can help you with a wide range of topics and tasks. Here are some examples:

1. **Answering questions**: I can provide information on history, science, technology, literature, and more.
2. **Language translation**: I can translate text from one language to another, including popular languages like Spanish, French, German, Chinese, and many more.
3. **Writing assistance**: I can help with writing tasks like proofreading, suggesting alternative phrases, and even generating text based on a prompt.
4. **Math and calculations**: I can perform mathematical operations, from simple arit

---
## 1.4 — Text-to-Speech (TTS): Giving the Assistant a Voice

The final component converts the LLM's text response into audible speech. For Part 1, we use **gTTS (Google Text-to-Speech)** — it's simple, requires no GPU, and produces intelligible (if somewhat robotic) output.

> **Design note:** gTTS is a *concatenative / parametric* system — it strings together pre-recorded phoneme segments. In Part 2, we'll upgrade to a *neural* TTS model that sounds much more natural.

In [8]:
class TextToSpeechEngine:
    """Simple TTS wrapper using gTTS."""

    def __init__(self):
        print("🔊 TextToSpeechEngine ready (gTTS)")

    def synthesize(self, text: str, filename: str = "tts_output.mp3"):
        """Convert text to speech; returns an IPython Audio object."""
        gTTS(text=text, lang="en", slow=False).save(filename)
        return Audio(filename)

# Instantiate
tts_engine = TextToSpeechEngine()

🔊 TextToSpeechEngine ready (gTTS)


In [9]:
# Test with a few sample utterances
test_utterances = [
    "Hello! I'm your voice assistant for this lab.",
    "I can help you with speech-to-text and text-to-speech.",
    "Just ask me anything you'd like to know!",
]

print("── TTS Playback Test ──────────────────────────────────")
for text in test_utterances:
    print(f"  💬 {text}")
    audio = tts_engine.synthesize(text)
    display(audio)

print("\n🔊 TTS component verified!")

── TTS Playback Test ──────────────────────────────────
  💬 Hello! I'm your voice assistant for this lab.


  💬 I can help you with speech-to-text and text-to-speech.


  💬 Just ask me anything you'd like to know!



🔊 TTS component verified!


---
## 1.5 — Wiring It All Together: The Complete Voice Assistant

Now we connect the three building blocks into a single `DemoAssistant` class. Calling `process_audio_input(file)` runs the full loop:

```
Audio file  ──▶  STT  ──▶  LLM  ──▶  TTS  ──▶  Audio response
```

The class also keeps a **conversation history** so we can review the dialogue later.

In [10]:
class DemoAssistant:
    """End-to-end voice assistant: Audio → STT → LLM → TTS → Audio."""

    def __init__(self, stt, llm, tts):
        self.stt = stt
        self.llm = llm
        self.tts = tts
        self.history = []

    # ── Core pipeline ────────────────────────────────────────────────────
    def process_audio_input(self, audio_file: str):
        """Run the full pipeline on a single audio file."""
        print(f"\n{'─'*50}")
        print(f"📂 Input: {audio_file}")

        # 1) Speech-to-Text
        user_text = self.stt.transcribe(audio_file)
        if not user_text:
            print("  ⚠️ Could not transcribe audio.")
            return None
        print(f"  🎤 STT  → \"{user_text}\"")

        # 2) LLM Reasoning
        llm_response = self.llm.respond(user_text)
        print(f"  🧠 LLM  → \"{llm_response}\"")

        # 3) Text-to-Speech
        response_audio = self.tts.synthesize(llm_response)
        print(f"  🔊 TTS  → audio generated")

        # Save to history
        self.history.append({
            "time": datetime.now().strftime("%H:%M:%S"),
            "user": user_text,
            "assistant": llm_response,
        })
        return response_audio

    # ── Demo: run through all test files ─────────────────────────────────
    def demo_conversation(self, audio_files):
        for i, (audio_file, original_text) in enumerate(audio_files, 1):
            print(f"\n🗣️  Turn {i}  (original: \"{original_text}\")")
            display(Audio(audio_file))               # play input
            response = self.process_audio_input(audio_file)
            if response:
                print("  ▶️  Response:")
                display(response)
            time.sleep(0.5)

    # ── Pretty-print history ─────────────────────────────────────────────
    def show_history(self):
        print("\n══ Conversation History ════════════════════════════════")
        for i, turn in enumerate(self.history, 1):
            print(f"  [{turn['time']}]  👤 {turn['user']}")
            print(f"             🤖 {turn['assistant']}\n")

# Build the assistant from our three components
assistant = DemoAssistant(stt_engine, llm_engine, tts_engine)
print("🤖 DemoAssistant ready!")

🤖 DemoAssistant ready!


### 🚀 Run the Demo

The cell below feeds our four test audio files through the full pipeline. For each one you'll hear the **input** (synthesised by gTTS earlier) and then the **assistant's spoken response**.

In [11]:
assistant.demo_conversation(test_audio_files)
assistant.show_history()


🗣️  Turn 1  (original: "Hello, how are you today?")



──────────────────────────────────────────────────
📂 Input: test_input_1.mp3
  🎤 STT  → "Hello, how are you today?"
  🧠 LLM  → "I'm doing well, thank you for asking. I'm here to assist you with any questions or concerns you may have, and I'm ready to help whenever you need me. No emotions or feelings like humans do, but I'm functioning properly and ready to provide information. How can I help you today?"
  🔊 TTS  → audio generated
  ▶️  Response:



🗣️  Turn 2  (original: "What is the weather like?")



──────────────────────────────────────────────────
📂 Input: test_input_2.mp3
  🎤 STT  → "What is the weather like?"
  🧠 LLM  → "Unfortunately, I'm a large language model, I don't have real-time access to your location or current weather conditions. However, I can suggest a few options to help you find out the weather:

1. **Check online weather websites**: You can visit websites like AccuWeather, Weather.com, or the National Weather Service (NWS) to get the current weather conditions and forecast for your area.
2. **Use a mobile app**: Download a weather app on your smartphone, such as Dark Sky or Weather Underground, to get real-time updates on the weather.
3. **Ask a voice assistant**: If you have a smart speaker or virtual assistant, like Siri, Google Assistant, or Alexa, you can ask them to tell you the weather"
  🔊 TTS  → audio generated
  ▶️  Response:



🗣️  Turn 3  (original: "Tell me a joke about artificial intelligence.")



──────────────────────────────────────────────────
📂 Input: test_input_3.mp3
  🎤 STT  → "Tell me a joke about artificial intelligence."
  🧠 LLM  → "Why did the AI program go on a diet? 

Because it wanted to lose some bytes."
  🔊 TTS  → audio generated
  ▶️  Response:



🗣️  Turn 4  (original: "What can you help me with?")



──────────────────────────────────────────────────
📂 Input: test_input_4.mp3
  🎤 STT  → "What can you help me with?"
  🧠 LLM  → "I can assist you with a wide range of topics and tasks. Here are some examples of what I can help you with:

1. **Answering questions**: I can provide information on various subjects, including history, science, technology, literature, and more.
2. **Language translation**: I can translate text from one language to another, including popular languages such as Spanish, French, German, Chinese, and many others.
3. **Writing and proofreading**: I can help you with writing and proofreading tasks, including suggestions for grammar, syntax, and style.
4. **Math and calculations**: I can perform mathematical calculations, from basic arithmetic to advanced calculus and more.
5. **Conversation and chat**: I can engage in natural-sounding"
  🔊 TTS  → audio generated
  ▶️  Response:



══ Conversation History ════════════════════════════════
  [00:05:03]  👤 Hello, how are you today?
             🤖 I'm doing well, thank you for asking. I'm here to assist you with any questions or concerns you may have, and I'm ready to help whenever you need me. No emotions or feelings like humans do, but I'm functioning properly and ready to provide information. How can I help you today?

  [00:05:07]  👤 What is the weather like?
             🤖 Unfortunately, I'm a large language model, I don't have real-time access to your location or current weather conditions. However, I can suggest a few options to help you find out the weather:

1. **Check online weather websites**: You can visit websites like AccuWeather, Weather.com, or the National Weather Service (NWS) to get the current weather conditions and forecast for your area.
2. **Use a mobile app**: Download a weather app on your smartphone, such as Dark Sky or Weather Underground, to get real-time updates on the weather.
3. **Ask 

---
# ✏️ Exercise: Use Your Own Voice!

So far every input was *synthesised* audio. Now it's your turn — literally. You'll **record your voice** inside this Colab notebook, then send it through the same pipeline.

**What to do:**

1. **Run the recorder cell** below. It will ask for microphone permission and record for 10 seconds.
2. **Speak a question** (e.g., *"What is the tallest building in the world?"*).
3. **Run the pipeline cells** to transcribe your speech, get an LLM answer, and hear it spoken back.

> 💡 *Tip:* Speak clearly and not too fast. Background noise will reduce transcription accuracy.

### Step 1 — Record Your Audio

In [13]:
from IPython.display import Javascript, Audio, display
from google.colab import output
import base64

# ── JavaScript audio recorder (records for 10 seconds) ──────────────────
JS_RECORDER = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  let reader = new FileReader()
  reader.onloadend = () => resolve(reader.result)
  reader.readAsDataURL(blob)
})

var record = async function(){
  const stream = await navigator.mediaDevices.getUserMedia({audio: true});
  const rec = new MediaRecorder(stream);
  const data = [];
  rec.ondataavailable = e => data.push(e.data);
  rec.start();
  await sleep(5000);   // 5 seconds
  rec.stop();
  await new Promise(resolve => rec.onstop = resolve);
  let blob = new Blob(data);
  let text = await b2text(blob);
  return text;
}
"""

display(Javascript(JS_RECORDER))
print("⏺️  Recording will begin when you run the next cell.")

print("🎙️  Recording — speak now! (10 seconds)\n")
audio_data = output.eval_js("record()")

# Decode and save
binary = base64.b64decode(audio_data.split(",")[1])
with open("live_input.wav", "wb") as f:
    f.write(binary)

print("✅ Saved as live_input.wav")
print("\n▶️  Playback of your recording:")
display(Audio("live_input.wav"))

<IPython.core.display.Javascript object>

⏺️  Recording will begin when you run the next cell.
🎙️  Recording — speak now! (10 seconds)

✅ Saved as live_input.wav

▶️  Playback of your recording:


### Step 2 — Process Through the Pipeline

The three sub-steps below mirror exactly what `DemoAssistant.process_audio_input` does internally. **Your task:** fill in the `TODO` lines so the pipeline runs on your recorded audio.

> *Hint: look at how the same methods are called inside `DemoAssistant` above.*

In [14]:
# ── Step 2a: Speech-to-Text ─────────────────────────────────────────────
# TODO: Transcribe "live_input.wav" using stt_engine
user_text = stt_engine.transcribe("live_input.wav")
print(f"🎤 You said: \"{user_text}\"")

🎤 You said: "Hello?"


In [15]:
# ── Step 2b: LLM Response ───────────────────────────────────────────────
# TODO: Get a response from llm_engine using the transcribed text
llm_response = llm_engine.respond(user_text)
print(f"🧠 Assistant thinks: \"{llm_response}\"")

🧠 Assistant thinks: "Hello. It's nice to meet you. Is there something I can help you with or would you like to chat?"


In [17]:
# ── Step 2c: Text-to-Speech ─────────────────────────────────────────────
# TODO: Synthesize the LLM response using tts_engine
response_audio = tts_engine.synthesize(llm_response)

print("🔊 Playing the assistant's response:")
display(response_audio)

🔊 Playing the assistant's response:


In [18]:
# ── Conversation Summary ────────────────────────────────────────────────
print("\n══ Conversation Summary ════════════════════════════════")
print(f"  👤 You said      : {user_text}")
print(f"  🤖 Assistant said: {llm_response}")
print("════════════════════════════════════════════════════════")


══ Conversation Summary ════════════════════════════════
  👤 You said      : Hello?
  🤖 Assistant said: Hello. It's nice to meet you. Is there something I can help you with or would you like to chat?
════════════════════════════════════════════════════════


---
# Part 2: Neural TTS & Voice Cloning

In Part 1 we used **gTTS**, which is fast and free but sounds robotic. In Part 2 we swap it out for **XTTS v2** (by Coqui AI) — a transformer-based, multilingual neural TTS model that can even **clone a voice** from a short reference clip.

## 2.1 — Installing the Neural TTS Model

XTTS v2 requires the `espeak-ng` phonemizer and the `coqui-tts` package. The model itself is ~1.8 GB and takes 2–3 minutes to download on Colab.

In [19]:
# ── System dependencies & Python package ─────────────────────────────────
!apt-get update -qq
!apt-get install -y -qq espeak-ng
!pip install -q coqui-tts

print("\n✅ Coqui TTS installed.")

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Selecting previously unselected package libpcaudio0:amd64.
(Reading database ... 121852 files and directories currently installed.)
Preparing to unpack .../libpcaudio0_1.1-6build2_amd64.deb ...
Unpacking libpcaudio0:amd64 (1.1-6build2) ...
Selecting previously unselected package libsonic0:amd64.
Preparing to unpack .../libsonic0_0.2.0-11build1_amd64.deb ...
Unpacking libsonic0:amd64 (0.2.0-11build1) ...
Selecting previously unselected package espeak-ng-data:amd64.
Preparing to unpack .../espeak-ng-data_1.50+dfsg-10ubuntu0.1_amd64.deb ...
Unpacking espeak-ng-data:amd64 (1.50+dfsg-10ubuntu0.1) ...
Selecting previously unselected package libespeak-ng1:amd64.
Preparing to unpack .../libespeak-ng1_1.50+dfsg-10ubuntu0.1_amd64.deb ...
Unpacking libespeak-ng1:amd64 (1.50+dfsg-10ubuntu0.1) ...
Selecting previ

In [20]:
from TTS.api import TTS
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"🖥️  Device: {device}")

print("\n⏳ Loading XTTS v2 model (this may take 2–3 minutes)…")
xtts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
print("✅ XTTS v2 model loaded!")

# Show available built-in speakers
print(f"\n📋 Built-in speakers ({len(xtts.speakers)} total):")
print(", ".join(xtts.speakers[:10]), "…")

🖥️  Device: cuda

⏳ Loading XTTS v2 model (this may take 2–3 minutes)…
 > You must confirm the following:
 | > "I have purchased a commercial license from Coqui: licensing@coqui.ai"
 | > "Otherwise, I agree to the terms of the non-commercial CPML: https://coqui.ai/cpml" - [y/n]
 | | > y


100%|██████████| 1.87G/1.87G [00:30<00:00, 61.3MiB/s]
4.37kiB [00:00, 6.96MiB/s]
361kiB [00:00, 46.7MiB/s]
100%|██████████| 32.0/32.0 [00:00<00:00, 71.5kiB/s]
100%|██████████| 7.75M/7.75M [00:00<00:00, 14.3MiB/s]


✅ XTTS v2 model loaded!

📋 Built-in speakers (58 total):
Claribel Dervla, Daisy Studious, Gracie Wise, Tammie Ema, Alison Dietlinde, Ana Florence, Annmarie Nele, Asya Anara, Brenda Stern, Gitta Nikolina …


In [21]:
print(TTS().list_models())

['tts_models/multilingual/multi-dataset/xtts_v2', 'tts_models/multilingual/multi-dataset/xtts_v1.1', 'tts_models/multilingual/multi-dataset/your_tts', 'tts_models/multilingual/multi-dataset/bark', 'tts_models/bg/cv/vits', 'tts_models/cs/cv/vits', 'tts_models/da/cv/vits', 'tts_models/et/cv/vits', 'tts_models/ga/cv/vits', 'tts_models/en/ek1/tacotron2', 'tts_models/en/ljspeech/tacotron2-DDC', 'tts_models/en/ljspeech/tacotron2-DDC_ph', 'tts_models/en/ljspeech/glow-tts', 'tts_models/en/ljspeech/speedy-speech', 'tts_models/en/ljspeech/tacotron2-DCA', 'tts_models/en/ljspeech/vits', 'tts_models/en/ljspeech/vits--neon', 'tts_models/en/ljspeech/fast_pitch', 'tts_models/en/ljspeech/overflow', 'tts_models/en/ljspeech/neural_hmm', 'tts_models/en/vctk/vits', 'tts_models/en/vctk/fast_pitch', 'tts_models/en/sam/tacotron-DDC', 'tts_models/en/blizzard2013/capacitron-t2-c50', 'tts_models/en/blizzard2013/capacitron-t2-c150_v2', 'tts_models/en/multi-dataset/tortoise-v2', 'tts_models/en/jenny/jenny', 'tts_m

## 2.2 — Basic Neural TTS

Let's generate a sentence with one of the built-in speakers and listen to the difference compared to gTTS.

In [22]:
# Neural TTS with a built-in speaker
xtts.tts_to_file(
    text="Hello! This is XTTS generating high-quality speech.",
    speaker_id="Gracie Wise",
    language="en",
    file_path="xtts_basic.wav",
)

print("🔊 Neural TTS output:")
display(Audio("xtts_basic.wav"))

# Compare with gTTS for the same sentence
gTTS(text="Hello! This is gTTS generating basic speech.", lang="en").save("gtts_basic.mp3")
print("\n🔊 gTTS output (for comparison):")
display(Audio("gtts_basic.mp3"))

🔊 Neural TTS output:



🔊 gTTS output (for comparison):


## 2.3 — Voice Cloning

Voice cloning lets the model **mimic a speaker's voice** using just a short reference audio clip. Here's the workflow:

1. Provide a **reference audio** (5–15 seconds of someone speaking).
2. XTTS extracts the speaker's vocal characteristics.
3. Any new text is generated in that voice.

We'll first create a reference voice using gTTS (or you can use your own `live_input.wav` from the exercise above!).

In [23]:
# ── Create a reference voice ─────────────────────────────────────────────
reference_text = "This is my voice that will be cloned. I speak clearly and naturally."
gTTS(text=reference_text, lang="en", slow=False).save("reference_voice.wav")

print("🎧 Reference voice:")
display(Audio("reference_voice.wav"))

🎧 Reference voice:


In [24]:
# ── Generate new speech with the cloned voice ────────────────────────────
clone_text = "This is the cloned voice speaking a completely different sentence."

print(f"Generating cloned speech: \"{clone_text}\"\n")
start = time.time()

xtts.tts_to_file(
    text=clone_text,
    speaker_wav="reference_voice.wav",   # swap with "live_input.wav" to clone YOUR voice!
    language="en",
    file_path="cloned_output.wav",
)
elapsed = time.time() - start
print(f"⏱️  Generated in {elapsed:.1f}s\n")

print("🎧 1. Reference voice:")
display(Audio("reference_voice.wav"))
print("\n🎧 2. Cloned voice (new sentence):")
display(Audio("cloned_output.wav"))

Generating cloned speech: "This is the cloned voice speaking a completely different sentence."

⏱️  Generated in 4.2s

🎧 1. Reference voice:



🎧 2. Cloned voice (new sentence):


## 2.4 — Plugging Neural TTS into the Assistant

Let's replace gTTS with XTTS + voice cloning in our pipeline. The `VoiceCloneAssistant` below reuses our existing STT and LLM engines but generates the response in a **cloned voice**.

In [25]:
class VoiceCloneAssistant:
    """Voice assistant that responds in a cloned voice."""

    def __init__(self, stt, llm, tts_model, reference_wav):
        self.stt = stt
        self.llm = llm
        self.tts = tts_model
        self.reference_wav = reference_wav
        print(f"🗣️  VoiceCloneAssistant ready (reference: {reference_wav})")

    def process(self, audio_file: str):
        print(f"\n{'─'*50}")
        print(f"📂 Input: {audio_file}")

        # 1) STT
        user_text = self.stt.transcribe(audio_file)
        print(f"  🎤 STT  → \"{user_text}\"")

        # 2) LLM
        response = self.llm.respond(user_text)
        print(f"  🧠 LLM  → \"{response[:80]}…\"")

        # 3) Neural TTS with cloned voice
        output_file = "cloned_response.wav"
        self.tts.tts_to_file(
            text=response,
            speaker_wav=self.reference_wav,
            language="en",
            file_path=output_file,
        )
        print(f"  🔊 TTS  → {output_file}")

        print("\n  ▶️  Response:")
        display(Audio(output_file))
        return output_file

# Initialize with the reference voice we created earlier
clone_assistant = VoiceCloneAssistant(
    stt_engine, llm_engine, xtts, "reference_voice.wav"
)

🗣️  VoiceCloneAssistant ready (reference: reference_voice.wav)


In [26]:
# ── Test the cloned-voice pipeline ───────────────────────────────────────
test_question = "What is voice cloning? Explain in one sentence."
gTTS(text=test_question, lang="en").save("clone_test_input.wav")

print(f"Test question: \"{test_question}\"")
clone_assistant.process("clone_test_input.wav")

Test question: "What is voice cloning? Explain in one sentence."

──────────────────────────────────────────────────
📂 Input: clone_test_input.wav
  🎤 STT  → "What is voice cloning? Explain in one sentence."
  🧠 LLM  → "Voice cloning is the process of creating a digital replica of a person's voice, …"
  🔊 TTS  → cloned_response.wav

  ▶️  Response:


'cloned_response.wav'

## 2.5 — Bonus: Multilingual Speech

XTTS v2 supports 17 languages in a single model. Let's hear the same welcome message in several languages.

In [27]:
multilingual_samples = [
    ("en", "English",    "Hello! Welcome to our artificial intelligence lab."),
    ("es", "Spanish",    "¡Hola! Bienvenido a nuestro laboratorio de inteligencia artificial."),
    ("fr", "French",     "Bonjour! Bienvenue dans notre laboratoire d'intelligence artificielle."),
    ("de", "German",     "Hallo! Willkommen in unserem Labor für künstliche Intelligenz."),
    ("it", "Italian",    "Ciao! Benvenuto nel nostro laboratorio di intelligenza artificiale."),
    ("pt", "Portuguese", "Olá! Bem-vindo ao nosso laboratório de inteligência artificial."),
]

print("🌍 Multilingual TTS Demo\n")
for lang_code, lang_name, text in multilingual_samples:
    print(f"  {lang_name}: \"{text}\"")
    filename = f"multilang_{lang_code}.wav"
    try:
        xtts.tts_to_file(
            text=text,
            language=lang_code,
            speaker_id="Daisy Studious",
            file_path=filename,
        )
        display(Audio(filename))
    except Exception as e:
        print(f"    ⚠️ Error: {e}")
    print()

🌍 Multilingual TTS Demo

  English: "Hello! Welcome to our artificial intelligence lab."



  Spanish: "¡Hola! Bienvenido a nuestro laboratorio de inteligencia artificial."



  French: "Bonjour! Bienvenue dans notre laboratoire d'intelligence artificielle."



  German: "Hallo! Willkommen in unserem Labor für künstliche Intelligenz."



  Italian: "Ciao! Benvenuto nel nostro laboratorio di intelligenza artificiale."



  Portuguese: "Olá! Bem-vindo ao nosso laboratório de inteligência artificial."





## 2.6 — Execrise: Attempting Emotional Expression

Can we convey different emotions just by changing the text? This is an open experiment — current lightweight models have limited expressiveness, so don't expect Hollywood-level acting!

> **Reflection question:** What design choices would make emotional TTS more convincing? Think about prosody, pacing, pitch range, and what training data would be needed.

In [29]:
tts_vits = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

emotion_examples = {
    "Angry":   "I CAN'T BELIEVE WHAT JUST HAPPENED! This is unacceptable!",
    "Shock": "I can't believe the lakers sucks so bad this season",
    "Sad": "This is unacceptable, I feel so sad right now ...",
}

print("🎭 Emotional Expression Experiment\n")
for emotion, text in emotion_examples.items():
    print(f"  {emotion}: \"{text}\"")
    output = f"emotion_{emotion.lower()}.wav"
    tts_vits.tts_to_file(text=text, file_path=output, speaker_id="Marcos Rudaski", language='en')
    display(Audio(output))
    print()

🎭 Emotional Expression Experiment

  Angry: "I CAN'T BELIEVE WHAT JUST HAPPENED! This is unacceptable!"



  Shock: "I can't believe the lakers sucks so bad this season"



  Sad: "This is unacceptable, I feel so sad right now ..."



